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METHOD FOR IMPLEMENTING THE CfflNESE REMAINDER THEOREM 



Background of the Invention 

The present application is directed to a method and apparatus for performing encryption 
and decryption. The application discloses several inventions relating to an overall system for the 
5 use of exponentiation modulo N as a mechanism for carrying out the desired cryptological goals 
and functions in a rapid, efficient, accurate and reliable manner. A first part of the disclosure is 
related to the construction of a method and its associated apparatus for carrying out modular 
multiplication. A second part of the disclosure is directed to an improved apparatus for carrying 
out modular multiplication through the partitioning of the problem into more manageable pieces 
10 and thus results in the construction of individual identical (if so desired) Processing Elements. A 
□third part of the disclosure is directed to the utilization of the resulting series of Processing 
jElements in a pipelined fashion for increased speed and throughput. A fourth part of the 
;;^disclosure is directed to an apparatus and method for calculating a unique inverse operation that 
==3s desirable as an input step or stage to the modular multiplication operation. A fifth part of the 
15 S, ^[disclosure is directed to the use of the modular multiplication system described herein in its 
!^ originally intended fiinction of performing an exponentiation operation. A sixth part of the 
1 ydisclosure is directed to the use of the Chinese Remainder Theorem in conjunction v^ith the 
I =^xponentiation operation. A seventh part of the this disclosure is directed to the construction and 
j: jitilization of checksum circuitry which is employed to insure reliable and accurate operation of 
20 the entire system. The present application is particularly directed the invention described in the 
sixth part of the disclosure. 

More particularly, the present invention is directed to circuits, systems and methods for 
multiplying tv^o binary numbers having up to n bits each with the multiplication being modulo, 
an odd number. In particular, the present invention partitions one of the factors into m blocks 
25 with k bits in each block with the natural constraint that mk>n + 2. Even more particularly, the 
present invention is directed to multiplication modulo when the factors being multiplied have 
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a large number of bits. The present invention is also particularly directed to the use of the 
modular multiplication function hardware described herein in the calculation of a modular 
exponentiation function for use in cryptography. Ancillary functions, such as the calculation of a 
convenient inverse and a checksum mechanism for the entire apparatus are also provided herein. 
The partitioning employed herein also results in the construction of Processing Elements which 
can be cascaded to provide significant expansion capabilities for larger values of N. This, in turn, 
leads to a modality of Processor Element use in a pipelined fashion. The cascade of Processor 
Elements is also advantageously controllable so as to effectively partition the Processor Element 
chain into separate pieces which independently work on distinct and separate factors of N. 

Those wishing an optimal understanding from this disclosure should appreciate at the 
outset that the purpose of the methods and circuits shown herein is the performance of certain 
3 arithmetic functions needed in modem cryptography and that these operations are not standard 

Si 

, jmultiplication, inversion and/or exponentiation, but rather are modulo N operations. The fact 
;;^that the present application is directed to modular arithmetic circuits and methods, as opposed to 
.^standard arithmetic operations, is a fact which would be best to keep firmly in mind, particularly 
^ Jsince modular arithmetic, with it implied division operations, is much more difficult to perform 
"f and to calculate, particularly where exponentiation modulo N is involved. 
\ y 

In a preferred system for implementation which takes advantage of certain aspects of the 
present invention, this application is also directed to a circuit and method of practice in which an 
adder array and a multiplier array are effectively partitioned into in a series of nearly identical 
processor elements with each processor element (PE) in the series operating on a sub-block of 
data. The multiplier array and adder array are thus partitioned. Thus, having recognized the 
ability to reconfigure the generic structure into a plurality of serially connected processor 
elements, the present invention is also directed to a method of operation in which each processor 
element operates as part of a pipeline over a plurality of operational cycles. The pipelining mode 
of operation is even further extended to the multiplication of a series of numbers in a fashion in 
which all of the processor elements are continuously actively generating results. 
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The multiplication of binary numbers module is an important operation in modem, 
public-key cryptography. The security of any cryptographic system which is based upon the 
multiplication and subsequent factoring of large integers is directly related to the size of the 
numbers employed, that is, the number of bits or digits in the number. For example, each of the 
two multiplying factors may have up to 1,024 bits. However, for cryptographic purposes, it is 
necessary to carry out this multiplication modulo a number N. Accordingly, it should be 
understood that the multiplication considered herein multiplies two n bit numbers to produce a 
result with n bits or less rather than the usual 2n bits in conventional multiplication. 

However, even though there is a desire for inclusion of a large number of bits in each 
factor, the speed of calculation becomes significantly slower as the number of digits or bits 
increase. However, for real-time cryptographic purposes, speed of encryption and decryption are 
J important concerns. In particular, real-time cryptographic processing is a desirable result. 

Different methods have been proposed for carrying out modular multiplication. In 
^particular, in an article appearing in "The Mathematics of Computation," Vol. 44, No. 170, April 
jl995, pp. 519-521, Peter L. Montgomery describes an algorithm for "Modular Multiplication 
without Trial Division." However, this article describes operations that are impractical to 
Uimplement in hardware for a large value of N, Furthermore, the method described by 
-Montgomery operates only in a single phase. In contrast, the system and method presented 
^herein partitions operational cycles into two phases. From a hardware perspective, the 
partitioning provides a mechanism for hardware sharing which provides significant advantages. 

Summary of the Invention 

In accordance with a preferred embodiment of the present invention, an initial zero value 
is stored in a result register Zo. The integers A and B which are to be multiplied using the present 
process are partitioned into m blocks with k bits in each block. The multiplication is carried out 
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modulo N, Additionally, the value R is set equal to 2*. In this way, the integer A is representable 
as ^4 = Am-iR""^ + • " + ^ 2/?^+ AiR Ao^ This is the partitioning of the integer A into m blocks. 



In one embodiment of the present invention, a method and circuit are shown for 
computing a function Z = f(A, B) = AB 2"^"* mod N, Later , it will be shovm how this function is 
5 used to calculate AB mod N itself. 

The system, methods, and circuits of the present invention are best understood in the 
context of the underlying algorithm employed. Furthermore, for purposes of understanding this 
algorithm, it is noted that modular computation is carried out modulo A^, which is an odd number 
and n is the number of bits in the binary representation of N, Additionally, No represents the least 
10 significant k bits of A^. Also, a constant s is employed which is equal to -1/No mod 

% R = 1/(R - No) mod 7?. With this convention, the algorithm is expressed in pseudo code as 
"^follows: 

i'. 

i. -i 

; = Zo = o 

h. H 

fori = Otom-l 

15 1=^ Xi=Z+AiB 

rU 

yi = s Xi,o mod R (x i,o is the least significant k bits of Xt) 
g Z;.i = {X.+y^N)/R 

O end. 



There are two items to note in particular about this method for carrying out modulo A'^ 
20 multiplication. The first thing to note is that the multiplication is based upon a partitioning of 
one of the factors into sub-blocks with k bits in each block. This greatly simplifies the size of 
multiplier arrays which need to be constructed. It furthermore creates a significant degree of 
parallelism which permits the multiplication operation be carried out in a much shorter period of 
time. The second item to note is that the partitioning also results in the splitting of the process 
25 not only into a plurality of m cycles, but also, splits the method into two phases that occur in each 
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cycle. In the first phase (Jf-phase), the values Xi and j^/ are computed. In the second phase 
(Z-phase), the intermediate result value Z/+/ is calculated. It should be noted that, in the 
calculation of^Y; and in the calculation of Z;>/, there is an addition operation and a multiplication 
operation. This fact allows the same hardware which performs the multiplication and addition in 
each of these steps to be shared rather than duplicated. With respect to the division by R in the 
formation of Zz+z, it is noted that this is accomplishable by simply discarding the low order k bits. 
Other advantages of this structure will also become apparent. 

The output of the above hardware and method produces the product AB 2'""^ mod N, To 
produce the more desirable result AB mod A^, the method and circuit employed above is used a 
second time. In particular, the original output from this circuit is supplied to one of its input 
registers with the other register containing the factor 2^'"^ mod N, This factor eliminates the first 
3 factor of 2 '"* added during the first calculation and also cancels the additional factor of 2'""* 

a, 

^included when the circuit is run the second time. This produces the result AB mod N. 

S For those who wish to practice the processes of the present invention via software, it is 

Jnoted that the algorithm for multiplication provided above is readily implementable in any 

standard procedure-based programming language with the resulting code, in either source or 
Uobject form, being readily storable on any convenient storage medium, including, but certainly 
=pot limited to, magnetic or optical disks. This process is also eminently exploitable along with 
:=the use of the exponentiation processes described below, including processes for exponentiation 

based on the Chinese Remainder Theorem. 

In the process described above it is noted that one of the process inputs is the variable "s". 
This value is calculated as a negative inverse modulo R. In order to facilitate the generation of 
this input signal, a special circuit for its generation is described herein. This circuit also takes 
advantage of existing hardware used in other parts of a processing element. In particular, it 
forms a part of the rightmost processor element in a chain. 
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Note that, in the calculation shown above for X, and Z/, these are more than n bit 
numbers. Accordingly, the multiplication and addition operations are carried out in relatively 
large circuits which are referred to herein as multiplier and adder arrays. In accordance with a 
preferred method of practicing the present invention, the adder array and multiplier array are split 
into sub-blocks. While this partitioning of hardware may be done using any convenient number 
of blocks, partitioning into blocks capable of processing k bits at a time is convenient. Thus, in 
the preferred embodiment, instead of employing one large multiplier array for processing two 
numbers having « + 7 bits and k bits; with n being much greater than k, a plurality of separate k 
bit by k bit multipliers are employed. Additionally, it is noted that partitioning into processor 
element sub-blocks, while useful in and of itself particularly for circuit layout efficiency, also 
ultimately makes it possible to operate the circuit in several pipelined modes. 

In a first pipelined mode, the circuit is operated through a plurality of cycles, m, in which 
"adjacent processor elements are operated in altemate phases. That is, in a first pipelined mode, if 
;^a processor element is in the X-phase, its immediate neighbors are operating in the Z-phase, and 
;Svice versa. In a second pipelined mode, the pipelined operation is continued but with new entries 
; |in the input registers (A and B) which now are also preferably partitioned in the same manner as 
the multiplier and adder arrays. 

I Since n is generally much greater than k (1,024 as compared to 32, for example) and since 
=;;3:arry propagation through adder stages can contribute significantly to processing delays, the 

3' = 

partitioning and pipelining together eliminate this source of circuit delay and the corresponding 
dependence of circuit operation times on the significant parameter n whose size, in cryptographic 
contexts, determines the difficulty of unwarranted code deciphering. 

The pipelined circuit of the present invention is also particularly useful in carrying out 
exponentiation modulo A^, an operation that is also very useful in cryptographic applications. 
Such an operation involves repeated multiplication operations. Accordingly, even though 
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pipelining may introduce an initial delay, significant improvements in performance of 
exponentiation operations are produced. 

In one embodiment found within the disclosure herein it has been noted that the chaining 
together of individually operating Processing Elements introduces an addition operation in a 
critical timing path, that is, into a path whose delayed execution delays the whole process. The 
present invention provides an improvement in the design of the individual Processing Elements 
through the placement of this addition operation in an earlier portion of the Processing Elements 
operation. In doing so, however, new control signals are also provided to make up for the fact 
that some signals in some of the Processing Elements are not yet available at this earlier stage 
and accordingly are, where convenient, provided from operations occurring or which have 
already occurred in adjacent Processing Elements. 

I The Processing Elements used herein are also specifically designed so that they may 

;;^function in different capacities. In particular, it is noted that the rightmost Processing Element 

,,;performs some operations that are unique to its position as the lower order Processing Element in 

I n 

r nthe chain. Likewise the leftmost element has a unique role and can assume a simpler form. 

However, the Processing Elements employed herein are also specially designed and constructed 
]=LSO as to be able to adapt to different roles in the chain. In particular, the middle Processing 
" Element is controllable so that it takes on the fiinctional and operational characteristics of a 
I. Rightmost Processing Element. In this way the entire chain is parti tionable so that it forms two 

(or more, if needed) separate and independent chains operating (in preferred modalities) on 

factors of the large odd integer A^. 

While an intermediate object of the present invention is the construction of a modular 
multiplication engine, a more final goal is providing an apparatus for modular exponentiation. In 
the present invention this is carried out using the disclosed modular multiplier in a repeated 
fashion based on the binary representation of the exponent. A further improvement on this 
process involves use of the Chinese Remainder Theorem for those parts of the exponentiation 
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operation in which the factors of are known. The capability of the Processing Element chain 
of the present invention to be partitioned into two portions is particularly useful here since each 
portion of the controUably partitioned chain is able to work on each of the factors of in an 
independent and parallel manner. 

Since one wishes to operate computational circuits at as high a speed as possible and 
since this can some times lead to erroneous operations, there is provided a challenge in how to 
detect errors when the operations being performed are not based on standard arithmetic, but are 
rather based on modular arithmetic (addition, subtraction, inversion and multiplication and 
exponentiation). However, the present invention solves this problem through the use of circuits 
and methods which are not only consonant with the complicating requirements of modular 
arithmetic operations but which are also capable of being generated on the fly with the addition 
;a of only a very small amount of additional hardware and with no penalty in time of execution or 

Si 

I J throughput. 

Accordingly, it is seen that it is an object of the present invention to produce a multiplier 
J^jfor multiplying two large integers modulo A^. 

fit It is yet another object of the present invention to improve the performance and 

1, .f_ 

[ =iCapabilities of cryptographic circuits and systems. 

It is a still further object of the present invention to create a multiplier circuit which 
operates at high speed. 

It is yet another object of the present invention to create a multiplier circuit which 
performs multiplication modulo without having to perform division operations. 

It is also an object of the present invention to provide a multiplier which is scaleable for 
various values of A^ and n. 
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It is also another object of the present invention to provide a method for computing a 
product of two integers modulo Nina mult i -phase process which permits sharing of hardware 
circuitry across the two phases. 

It is yet another object of the present invention to provide a system and method in which 
the factors are partitioned into a plurality of m sub-blocks with each sub-block having k bits, 
whereby values for m and k are selectable so as to provide additional flexibility in hardware 
structure. 

It is also another object of the present invention to increase the speed of multiplication 
calculations in cryptographic processes. 

It is also an object of the present invention to provide an implementation for a multiplier 
\ circuit which uses macro components as building blocks so as to avoid the costs associated with 
= custom design. 

; It is also an object of the present invention to provide a design which is flexible and 

scaleable. 

=^ It is also an object of the present invention to provide a word-oriented, as opposed to a 

3bit-oriented, multiplication system and circuit. 

It is a still further object of the present invention to construct a circuit for multiplication 
modulo A'^ which comprises a plurality of nearly identical processor elements. 

It is yet another object of the present invention to partition the multiplication of an « bit 
number into a plurality of pieces for quasi-independent calculation. 
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It is still another object of the present invention to operate the circuit herein in a pipelined 

mode. 

It is an even further object of the present invention to operate the circuit herein so as to 
process sequences of distinct operands (factors) in a pipelined mode. 

It is yet another object of the present invention to improve the performance of a sequence 
of chained Processing Elements by removing addition functions from critical paths. 

It is a still further object of the present invention to operate the circuit herein so as to 
process sequences of identical or repeated operands in a pipelined mode, as for example, in the 
calculation of the exponential function modulo N. 

It is yet another object of the present invention to increase the speed of exponentiation 
operations in cryptographic processes. 

It is a still further object of the present invention to provide Processing Elements whose 
character as beginning, middle or end units in the chain may be controlled so as to enable the 
J partitioning of the chain into a plurality of sub-chains each of which is capable of independent 
parallel processing based on a factor of N. 

It is also an object of the present invention to provide a mechanism for calculating an 
inverse operation which is useful as an input to the method of modular multiplication employed 
herein. 

It is yet another object of the present innovation to provide an apparatus and method for 
generating useful checksums which are useful for indicating that the system has operated in a 
proper fashion and has produced no errors. 
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It is a still further object of the present invention to provide a checksum circuit and 
method which is consonant with modular arithmetic. 

It is also an object of the present invention to provide an engine which is capable of data 
encryption through the use of exponentiation modulo N, a large prime or the product of two large 
primes. 

It is a further object of the present invention to provide an engine which is capable of data 
decryption through the use of exponentiation modulo N. 

It is yet another object of the present invention to employ the Chinese Remainder 
Theorem to facilitate the exponentiation operation modulo when factors for N are known. 

It is also an object of the present invention to provide an encryption/decryption engine 
which is capable of operating in the mode of public key cryptographic systems. 

It is also an object of the present invention to provide an engine which is capable of 
generating and receiving documents having coded digital signatures. 

It is also an object of the present invention to provide an engine which is capable of 
f generating keys to be exchanged between any two users for data encryption and decryption. 

It is also an object of the present invention to produce a high-speed, high-performance 
cryptographic engine. 

Lastly, but not limited hereto, it is an object of the present invention to provide a 
cryptographic engine for encryption and for decryption which can be included as part of a larger 
processing system and therefore possesses communication capabilities for the transfer of data and 
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command information from other parts of a larger scale data processing system with which the 
present engine is coupled. 

The recitation herein of a list of desirable objects which are met by various embodiments 
of the present invention is not meant to imply or suggest that any or all of these objects are 
5 present as essential features, either individually or collectively, in the most general embodiment 
of the present invention or in any of its more specific embodiments. 

Description of the Drawings 

The subject matter which is regarded as the invention is particularly pointed out and 
distinctly claimed in the concluding portion of the specification. The invention, however, both as 
10 □ to organization and method of practice, together with the further objects and advantages thereof, 
]:1 may best be understood by reference to the following description taken in connection with the 
accompanying drawings in which: 

!/ • Figure 1 is a block diagram illustrating the circuits employed in the method and system 

for multiplication modulo N described herein ; 

I y 

15 ;^ Figure 2 is a block diagram identical to Figure 1 except more particularly showing those 

O data flow paths which are active during a first or X-phase of calculation; 

Figure 3 is a block diagram similar to Figures 1 and 2 except more particularly showing 
those data flow paths which are active during the second or Z-phase of calculation.; 

20 Figure 4 is a block diagram of the rightmost processing element in a series of processing 

elements in a partitioned embodiment of the circuit of Figure 1; 
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Figure 4A is a block diagram similar to Figure 4 but which illustrates an alternate 
multiplier- to-adder connection; 

Figure 5 is a block diagram illustrating one of a plurality of identical processing elements 
which are configurable as a series of processor elements capable of performing the same 
operation as the circuit shown in Figure 1 ; 

Figure 5 A is a block diagram similar to Figure 5 but which also illustrates an altemate 
multiplier-to-adder connection; 

Figure 6 is a block diagram illustrating the form of a processing element that could 
expeditiously be employed as the last or leftmost processor element in a series of processor 
elements for carrying out the same calculations as the circuit of Figure 1 ; 

Figure 7 is a block diagram illustrating how the processor elements described in Figures 
4, 5, and 6 are connected to produce the same results as the circuit shown in Figure 1 ; 

Figure 8 is a block diagram illustrating the logical connection of processor elements over 
time with particular reference to register storage and the A" and Z phases of operation; 

Figure 9 is a block diagram illustrating the use of processor elements in a pipelined 
fashion; 

Figure 10 is a block diagram illustrating a typical processor element as configured for use 
in a pipelining mode; 

Figure 1 1 is a block diagram similar to Figure 10 but more particularly illustrating a 
processor element to be used in the rightmost or lower order position; 
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Figure 12 is a block diagram similar to Figure 8 but more particularly showing a variation 
in the utilization of pipelining to speed up processing time by eliminating an adder from a critical 
path; 

Figure 13 is a block diagram illustrating an improved rightmost processor element in 
which an adder in a critical path has been moved to improve performance; 

Figure 14 is a block diagram similar to Figure 13 but more particularly illustrating a 
typical processor element for use in an improved pipeline operation; 

Figure 1 5 is a block diagram illustrating a preferred design for the leftmost processor 
element in an improved pipelined configuration; 

Figure 16 illustrates processor element utilization in pipelined operations; 

Figure 1 7 is a block diagram illustrating a circuit for calculating the negative modular 
inverse of a number; 

Figure 18 is a flow chart illustrating a method for using circuits which implement 
1 modular multiplication in a fashion so as to further implement the exponentiation function; 

Figure 19 is a flow chart similar to Figure 18 but exhibiting an alternative algorithm for 
implementing a modular exponentiation fimction; 

Figure 20 is a block diagram of a circuit for implementing either one of the algorithms 
shown in Figures 18 or 19; 
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Figure 21 is a block diagram illustrating public key encryption and decryption processes 
particularly as it employs exponentiation operations, and more particularly illustrates the 
presence of signal variables used for efficiency improvements; 

Figure 22 is an overall block diagram view illustrating one embodiment of a 
cryptographic engine constructed in accordance with the present invention; 

Figure 23 is a block diagram illustrating the inclusion of a checksum mechanism 
consonant with a modulo N multiplication system; 

Figure 24 is a block diagram illustrating generically applicable circuits for generating 
intermediate checksum values using modulo (R- 1) addition; 

Figure 25 is a block diagram illustrating circuits for performing checksum operations 
used in a final checksum comparison operation which provides error indications; and 

Figure 26 is a block diagram illustrating circuits for generating checksum variables to be 
compared using, pairs of modulo (R- 1) adders. 

Detailed Description of the Invention 

The structure and operation of the present invention is dependent upon the partitioning of 
one of the multiplying factors into a plurality of k bit-wide pieces. Thus, instead of representing 

a binary number ^ as S a, 2\ one of the multiplying factors in the present invention is 

represented instead in the form ^„ _y . . . + ^2 7?^ + ^/ /? + = ^AjR\ where R = 2K In 

this representation, the number A is represented in block form where each of the m blocks 
includes k bits. That is, each Ai represents an integer having k bits. 
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In the present system, multiplication modulo an odd number is a significant object. 
Also, for purposes of understanding the present invention, the symbol n is used to denote the 
number of bits in the binary representation for N. Also, for present purposes, it is assumed that 
the number A, as stored in Register A (reference numeral 10 in Figure 1), is the number that is 
partitioned into m blocks. In general, the number of blocks m is selected to be the smallest 
integer for which mk> 2, Additionally, it is understood that No represents the least 
significant k bits of the number N. Likewise, the constant s is equal to the negative reciprocal of 
No taken modulo R (that is, -1/No mod R). 

From a mathematical point of view, the present applicants have employed an algorithm 
for which the input variables are the two numbers being multiplied, namely, A and B, the modulo 
number A^, the constant s associated with A^, and the parameters m, k and R = 2^ The output of 
the function provided by the present invention Z is given by Z = f(A, B) = AB 2 ^"* mod N. The 
procedure specified by applicants' method initializes the value Zo to be zero and, for the integer i 
ranging from 0 to w-7, calculations are carried out to produce Xi and yi and Z/+/. The values for 
Xi and yi are computed during a first operational phase of each one of m cycles. The value Z, is 
I computed during a second phase of each cycle. The adders and multipliers used to calculate Xi 
^ are "time shared" to also carry out the calculation needed to produce Z/. In particular, at each 
=^ stage / , Xi is given by Z/ + At 5. At this stage, the value of yi is also computed as the constant s 
J times the least significant k bits of X modulo R, If one represents the least significant k bits of 
Z,Xi as Xifi then;;/ = s x/,o. This completes the operations that are carried out in a first phase 
(X-phase) during one of the cycles of the present process. In the second phase (Z-phase), an 
updated value for Z register (50 in Figure 1) is computed as (3^+ yi N)/R, At the last stage of 
processing, the desired value of Z is present in the Z register. In particular, at this stage, Zm = AB 
2""* mod N, At each stage (cycle), values for X, y^ and Z/ are stored for purposes of computation 
in subsequent steps. 

It is noted that if both input variables A and B have n+1 bits, the output of the function 
provided by the present invention Z=f(A,B)= AB 2"'"* mod N, for being an n-bit odd number. 
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has no more than n-^J significant bits. That is, the output is less than 2""'. The hardware circuit 
described herein takes as mputs A and B of n+1 bits each and generates as output Z of /i+7 bits. 

A hardware circuit for carrying out this process is illustrated in Figure 1 . In particular, 
the factor^ of «+7 bits, which is the factor which is treated as being in partitioned form, is 
stored in A register (10). Multiplexor 1 1 supplies sequential blocks of k bits from register 10 to 
multiplexor 31, with k = 32 bits for illustration. Multiplexors 31, 21, and 52 operate in 
conjunction with one another selecting one of two possible input values depending upon whether 
or not the circuit is operating in the Jf-phase or the Z-phase. Accordingly, during the first phase 
of its operation, multiplexor 1 1 provides the k bits in Ao^ In the first phase of the second cycle, 
the next k bits AiinA are supplied via multiplexor 1 1 . A sub-block ofk bits from A is provided 
during the initial or A' phase portion of each cycle. In the third cycle, multiplexor 1 1 , therefore, 
provides the next k bits in A, namely, the bits denoted above and herein as Accordingly, 
multiplexor 1 1 is seen to operate selectively as a fiinction of the cycle number (namely, cycles 0 
through m-7). 

During theX-phase of each cycle, the value Ai is selected from the A Register (10) via 
multiplexor 1 1 and correspondingly multiplexor 21 selects the contents of the B Register (20). 
J Thus, in accordance with the present invention, the numbers to be multiplied are stored in 
^ registers 10 and 20. It does not matter which number is stored in which register. It is also noted 
3 that, whether or not the circuit is operating in the initial X-phase or in the final Z-phase in each 
cycle, multiplexors 31 and 21 supply A: bits and «+7 bits, respectively, to multiplier array 70 in 
each phase. It is thus seen that, during theX-phase, multiplexors 31 and 21 select contents from 
the B register and part of the A register. It is also noted that, in typical situations, the value of n 
is often around 512 or more and the value of k is approximately 32. Accordingly, it is seen that 
multiplier array 70 strikes a balance between 7 bit x « bit multiplication and full nhitxn bit 
multiplication. It is also noted that increases in the value of n are almost always, in practice, an 
increase by a factor of at least a power of 2. 
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As with any binary multiplier with inputs that are «+7 bits wide and k bits wide 
respectively, multiplier array 70 produces an output which is «+7+A: bits wide. The lower order k 
bits from multiplier array 70 are supplied to adder 65 which is designed to add two k bit addends 
at a time. In this regard, it is noted that adder 65 is present in the circuit for computing As 
such, and given that the value of yi is dependent upon the last k bits of the value Xi which is a 
sum which has not yet been fully computed, it is necessary to perform this addition which is 
essentially the addition for the low order k bits of X. The first addend comes from the rightmost 
k bits in the Z register as selected by multiplexor 52. These bits are added to the k bits in the 
rightmost portion of the product Ai B, The output of adder 65 is X/,o which is the least significant 
k bits of Xi = Z/ + AiB, This output is stored in register 55 and is also supplied to multiplier 80 
which multiplies two k bit numbers together. This is not, however, a multiplication modulo N, 
The other factor supplied to multiplier 80 is the number s from the s register (60). Since this 
result is required modulo R, only the rightmost k bits from multiplier 80 are supplied back to the 
y register (30) in this Jc^-phase. The value stored in this register is used during the calculation 
carried out in the Z-phase as discussed below. 

The rest of the X-phase calculation is devoted to calculation of the high order n^l bits of 
the sum Z/ + AtB, Multiplier 70 is configured as a circuit for multiplying together the bits from 
the B Register (20) and a sequence of m blocks of k bits each from selected k bit blocks Ai from 
the A register. Multiplication of two k bit numbers generally produces a number having 2k bits 
^ and, in particular, this is the situation with respect to applicants' multiplier 80. However, it is 
noted that the calculation of y, is computed modulo if. The modulo requirement of the 
computation is easily accomplished simply by returning only the rightmost k bits from the output 
of multiplier 80 to the input of the jj/ Register (30). 

As pointed out above, multiplication of numbers generally produces outputs having bit 
lengths greater than either of the two input number bit lengths. In particular, with respect to 
multiplier 70, the output is considered to be n+l+k bits in length. The low order (rightmost) k 
bit output is supplied from multiplier 70 to adder 65. However, each k bit block multiplication 
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carried out in multiplier array 70 produces 2k bits formed as a A: bit "result" and a k bit "carry" 
into the next position. The summation to produce the desired intermediate output Ai is carried 
out in adder 75 which adds together two portions, the first portion which is «+l bits long and the 
second portion which is only n^l-k bits long. The n+l-k bits represent the "carry" portion of 
the multiplication. Accordingly, the output of adder array 75 is the result of the high order n+1 
bits of AiB. This result is supplied directly to adder array 85 which adds to it a shifted value of Zi 
from Z register 50. And appropriately, this high order n^l bits of Xi = Zi + AiB is stored in Z 
register 50 in preparation for the Z-phase calculation. The low order k bits of are stored in 
register 55 as described above. 

In the Z-phase of an operation cycle, multiplier array 70 and adders 75 and 85 are again 
employed except that now the inputs to multiplier array 70 are the contents of the y Register (30) 
as selected by multiplexor 3 1 . The other factor supplied to multiplier array 70 is the contents of 
the A/^ register (40) which is selected during the Z-phase of an operation cycle by means of 
multiplexor 21 . As before, multiplier array 70 computes the product of an bit number and a 
k bit number. Adder array 75 performs the natural addition operation associated with 
I multiplication in which there is an effective carry-like operation from one k bit subfield to the 
next k bit subfield. Accordingly, the output of adder array 75 during the Z-phase of operation is 
J the high order n-^l bits of the product ^/A^. The addition of yiN and the value Xi together with its 
-.division by R in the present method is accomplished by discarding the low order k bits from the 
^output of adder 65 and storing only the high order bits from adder 85 to register 50. 

The differences in the X-phases and Z-phases of operation are more fully appreciated 
from an inspection of the differences between Figures 2 and 3. In particular, Figure 2 illustrates 
the active data flow paths that are present in the first or phase of each operational cycle. 
Likewise, Figure 3 illustrates the data flow paths which are active during the second or a Z-phase 
of each operational cycle. The calculations that are carried out in theX-phases and Z-phases are 
repeated a total of m times with the final result Zm being one of the desired results at the end of m 
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cycles of operation with each cycle including an ^-phase and a Z-phase. At this stage of 
operation, the value present in Z register 50 is AB 2"^* mod N. 

The circuit illustrated in Figures 1-3 is also capable of producing the multiplicative result 
AB mod N, This is accomplished by first using the circuit shown to compute AB 2"'"^ mod 
and then by using the circuit again with either the AorB register being provided with the output 
fi-om the first operational stage and multiplying this value by 2^'"*' mod N. Since each operation 
of the circuit (through m cycles) introduces the factor of 2"'"*, the multiplication by 2^"* cancels 
the first factor 2~"'^ introduced during the first stage of operation of the circuit and also cancels 
the other factor of 2"'"^ introduced during the second multiplicative stage of operation. Thus, 
using two passes (two stages) with m cycles each through the circuit of Figures 1-3, the result v45 
mod N is computed. For purposes of clarity and ease of understanding and description as used 
herein, an operational stage of the process of the present invention refers to m cycles of circuit 
operation following the loading of the factors into the^ and B registers. 

The operation of the above circuit is perhaps more easily understood by means of the 
following example in which k = 3, R = 2\ N = 107 = + 5R ^ 3 = (1, i, 3) = (N2.Ni, No), No 
= 3,m=3, s =^-l/No modR^5, A^83=R' + 2R + 3 = (1, 2, 3), B = 70 = R' OR ^ 6 = 
(1, 0, 6), Decimal digits are employed here merely for the sake of example and for an easier 
1 understanding of the process. For a more detailed illustration, the decimal numbers may be 
^ represented as blocks containing 3 bits each. The process carried out by the circuit disclosed 
above occurs in three steps as follows (i = 0, i = J, and i = 2): 



Step 1. 



Xo = Zo-^ AoB = (3, 2, 21 yo = 2s mod R = 2 
yoN=(2, JO, 6) =(3,2,6) 
Xo+yoN =(6.5.0) 
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Z,=(Xo+yoN)/R 
Step 2. 



= (0, 6, 5) 



A,B = (2,0.12) = (2,1.4) 

X, = Z,+A,B = (3.0,1). y,=s = 5 

y,N=(5,25, 15) = (1.0.2. 7) 

X,+y,N =(1,3,3.0) 

Z2 = (1, 3. 3) 



Step 3. 



A2B= = (1,0,6) 

X2 = Z2 + A2B = (2. 4, 1), y,=s = 5 

y2N=(5,25,15) = (l,0,2,7) 

X2+y:N =(1,2,7.0) 

Z3 = (1, 2, 7) = 87 



87xR^ =AxB mod N = 32. 



Although it is the objective to compute AB mod N where AB and are all n bits long, for 
convenience, the process herein employs A, B, and Z registers that are n+1 bits or mk bits long. 
This avoids the necessity for checking the final and intermediate results to determine whether or 
not they are in fact greater than N. This aspect, for example, shows up in Step 2 in the example 
provided above. 

The present inventors have also recognized that, at least partly due to the typically large 
difference between the size of n and k, there is a certain disparity in processing that occurs in the 
construction of an « by multiplier. Accordingly, it is possible to partition the calculation 
carried out in the circuit shown in Figures 1-3. In particular, the circuit shown in Figure 1 is in 



POU920000179US1 



-21 - 



fact constructable in the form a plurality, <i + 7, of processor elements (PE) which are connected 
together in a chained or cascaded fashion. Each of the processing elements is constructed in the 
same way. However, the processing element for the rightmost portion of the data, herein referred 
to as PEo, has a somewhat more complicated structure, as shown in Figure 4. A simpler circuit is 
employed for processing elements 1 through d. However, in preferred embodiments, the leftmost 
or last processor element PEd can in fact be constructed much more simply as shown in Figure 6. 
Accordingly, Figure 4 shows a structure for a processing element circuit for the rightmost portion 
of the data. Figure 5 illustrates a circuit for a processing element which is usable in a repeated 
fashion which utilizes as many individual processing elements as necessary and thus, illustrating 
the scalability aspects of the present invention. Lastly, Figure 6 illustrates a preferred, simplified 
embodiment for the last or leftmost processing element. 

3 For purposes of understanding and appreciating the present invention, the registers Ro 

through Rd, as illustrated in Figures 4, 5, and 6, are not considered as a part of the processing 
^ elements per se but rather are best understood as part of a separate, partitioned register structure. 
~ It is these registers that contain the desired results of the modulo N muhiplication operation. 
/J These registers thus serve the same function as the Z register in Figure 1. 

M| With specific reference to Figure 4, it is seen that multiplexor 193 operates during the 

^ X-phase to supply a 2k bit augend to adder 185. During the first or^-phase of operation, 
11 multiplexor 193 supplies a 2k bit number which has leftmost bits from register R2 (reference 
numeral 192) and rightmost bits from register Ri (reference numeral 191). During the second or 
Z-phase of prosecution, multiplexor 193 supplies a different 2k bits of data to adder 185. In 
particular, during the Z-phase multiplexor 193 supplies as its leftmost k bits the contents of 
register Ri, and as its rightmost k bits the contents of register Ro (reference numeral 190). 

In contrast to the fiilUwidth registers 10, 20, 40, and 50 in Figure 1, the corresponding 
registers in a partitioned system have fewer bits. In particular, the corresponding B and 
variable registers in a general processing element PE preferably employs a width equal to 2k bits. 
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However, for the rightmost processing element, a larger number of bits is desired. In particular, 
in the case in which n equals 512, registers 120 and 140 in Figure 4 preferably have a width of 96 
bits. Multiplexor 121 selects the contents of register B (reference numeral 120) during the 
X-phasc of computation and likewise selects the contents of register (reference numeral 140) 
during the Z-phase of computation. In general, the overall «-bit wide series of computations is 
broken down into partitions of any convenient size. It is not even necessary that all of the 
processor elements are the same size or process the same data width. However, for 
conveniences of circuit design and circuit layout, it is preferable that each of the individual 
processing elements (except for the rightmost element, PEo) have the same data processing 
capability in terms of data width. Therefore, in general, for purposes of consideration and 
discussion herein, it is assumed that there are a total ofd + J processing elements labeled from 
PEo through PEa. Processing element PEo preferably has a structure such as that shown in Figure 
4, PEd has the preferred structure illustrated in Figure 6, although it is noted that a more generic 
structure, such as that shown in Figure 5, may be employed for the leftmost processor element 
Pcd though it is not necessary that this leftmost processing element be any more complicated than 
that shown in Figure 6. 

Also, for purposes of convenience of circuit design, layout, and packaging efficiency, it is 
J generally desirable that the data width, W, of each processing element be an integer multiple of k. 
Z In the designs presented herein for a value of « = 5 12, processor elements PEi through PEa-i, 
leach process data in 2k bit wide chunks. Thus, in this example, W = 2k, where Wis the width of 
the data in each of the typical or generic forms of processing element, as illustrated in Figures 5 
and 5 A. It is noted that processor element PEo as shown in Figure 4 possesses an extra k bit 
processing capability, as is more particularly described below. Thus, if each typical processing 
element PEi processes data in Whit wide chunks and if there are <i + 7 processing elements with 
the rightmost processing element processing an extra k bits, then it is the preferred case that n = 
Wd + k. Thus, in general, the output of multiplexor 121 preferably comprises W k bits. The 
leftmost third of these bits are supplied to multiplier 173, the middle third of the bits in register 
BN (reference numeral 198) are supplied to multiplier 172, and the rightmost third bits are 
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supplied to multiplier 171. Multipliers 171, 172, and 173 are thus each A: bit by k bit multipliers. 
In this regard, it is noted that the original relatively large multiplier array 70 in Figure 1 employs 
an A? by A: multiplier. However, it is noted that the partitioning of the computation into a system 
employing a plurality of nearly identical processing elements results in the construction of 
circuits which now utilize multipliers which operate much more quickly since each multiplier 
now is typically only k bits by k bits. And clearly, since k is typically much less than n, 
processing takes place significantly faster. 

The leftmost of the 2k bits output from muhiplier 173 are supplied as a partial product out 
(PPO) to the next unit in the chain. In particular, it should be appreciated that in the discussions 
herein, that the natural order of processing is from the rightmost on through to the leftmost 
processing element in the chain (see Figure 7). Thus, data is passed ft"om one processing element 
to the processing element on its immediate left. However, it should be noted that left and right 
are relative terms useful essentially only for descriptive and understanding purposes. The 
rightmost k bits from multiplier 173 are supplied as the leftmost k bits of a 2k bit augend supplied 
to adder 175. The rightmost k bits of this 2k bit augend are supplied from the lower or rightmost 
A: bits of multiplier 172. Thus, the rightmost A: bits of multipliers 173 and 172, respectively, are 
combined, as shown in Figure 4, to supply a 2k bit wide augend to adder 175. Adder 175 also 
has as its other input a 2k bit augend which is supplied from the leftmost k bits of multiplier 1 72 
, and 171, respectively, with multiplier 172 supplying the leftmost k bits of the 2k bit augend and 
I with multiplier 171 supplying the rightmost k bits of the 2k bit augend supplied to adder 175. 

: 

Thus, adder 175 is a 2k bit wide adder. An equivalent but alternate connection arrangement is 
shown in Figure 4A. 

Multiplexor 152 operates to select, during the Jf-phase of computation, k bits from 
register Ro (reference numeral 190). During the Z-phase, multiplexor 152 selects as its input the 
contents of temporary register 150 containing the variable xo. The output of multiplexor 152 is 
supplied to adder 165 which is A: bits in width. Adder 165 receives two augends, namely, the 
rightmost k bits from multiplier 171 and the k bits supplied from multiplexor 152. The output of 



POU920000179US1 



-24- 



adder 165 is stored in temporary register 150 and is also supplied to multiplier 180 which is also 
a k bit by k bit multiplier. The other factor supplied to multiplier 1 80 is the contents of register 
1 60 which contains the variable s, (The calculation of ^ as -1/No mod R is efficiently carried out 
in the circuit shown in Figure 17 which is discussed in detail below.) The output of multiplier 
180 is supplied to register 130 which thus contains the value y as defined by the algorithm set out 
above. 

The output of register 130 is supplied to multiplexor 131 and is also supplied to the next 
processing element PEi (see Figure 5). Multiplexor 131 operates to select a portion of the 
variable A which is one of the factors in the multiplication operation. (Other k bit wide portions 
of variable A are selected by their respective processing elements.) In particular, register 1 10 
contains the rightmost k bits of the variable A. Thus, during the X-phase of operation, 
multiplexor 131 operates to select the contents of register 1 10 to be supplied to multipliers 173, 
172, and 171, as shown. Likewise, during the Z-phase of computation, multiplexor 131 operates 
to select the variable from register 130 to be supplied to this same set of multipliers as the other 
factor. 

A carry-out signal line from adder 165 is also supplied as a carry input to the lowest order 
position in adder 185, as shown. Additionally, adder 175 supplies a first carry-out signal line to 
the next processing element in the chain; similarly, adder 185 also supplies a second carry-out 
signal line to the next processing element in the chain. In particular, since Figure 4 illustrates 
processing element PEo, carry-out signal line 1 and carry-out signal line 2 are both provided to 
processing element PEi. The connections between PEo and PEi are readily apparent simply by 
placing Figure 4 to the right of Figure 5. In particular, processing element PEo supplies the 
variable j;, the partial product out, and the two carry-out signal lines to the inputs shown in PEi of 
Figure 5. In particular, it is also noted that the variable j^; (that is, the contents of register 130) is 
supplied to each one of the individual processing elements. And lastly, with respect to Figure 4, 
it is noted that the output of adder 1 85 is supplied to registers Ro and Ri shown at the top of 
Figure 4. As indicated above, it is the register set (containing Ri and Ro on the right) which 
uhimately contains the desired calculation result. Accordingly, reference numeral 100 in Figure 
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4 describing processing element PEo does not include this register set. This register set is 
discussed separately below in terms of some of the other variations and structures that are 
employed in the present invention. 

Attention is now directed to a discussion of Figure 5 which illustrates a more typical 
processor element and, in particular, which illustrates the form of a processor element which may 
be repeated in a circuit/system chain which is as long as is required to process factors which are n 
bits wide. 

With specific reference to Figure 5, it is noted that it is similar to Figure 4 except that the 
part of the processing dealing with k bit wide operations involving s and No need not be present 
in any processing elements except the rightmost one, namely, PEo. In particular, Figure 5 
indicates that the generic form of a processing element PEp bearing reference numeral 200 
specifically does include register BN (reference numeral 298) but does not include the other 
registers shown. One of the significant differences between Figures 4 and 5 is that register 220 
contains only a portion of the bits for the second factor B. In particular, register 220 contains 2k 
\ bit wide chunks designated as 82^+2 and 82^+/, where p ranges from 1 to d - L Again, as above, 
^ multiplexor 221 selects either the 2k bits from register 220 or the 2k bits from register 240 which 
J has corresponding portions (here 2k bits chunks) of the variable N. Accordingly, register BN is 
2 2k bits wide. Unlike register 198 in Figure 4, register 298 (BN) in Figure 5 is only 2k bits wide, 
fin one preferred embodiment of the present invention when w = 512, register BN is 64 bits wide. 

From an overall perspective, general processing element PEi (reference numeral 200 as 
shown in Figure 5) accepts, as input from the right, the value of>^, the partial product in, carry-in 
1 and carry-in 2. Processor element PEi also has as an input the corresponding portion of the k 
bits of the muhiplier factor^ from register 210. The register involvement for registers, 292, 291, 
and 290 is substantially as shown in Figure 4 except now shown in the generic version of a 
processor element. It is these registers that store intermediate values between phases and 
ultimately store the completed product, AB mod N, Also, from an overall perspective, processor 
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element 200 produces, as an output, a k bit partial product out which is supplied to the processor 
element on its left together with carryout signals 1 and 2 which are supplied to the corresponding 
adders 275 and 285 in the processor element on the left. The output of adder 285 is supplied 
back to registers R2p-^t and 7?2p. Accordingly, other than the connections to the register sets for R, 

N, and A, the processing elements are connected simply by matching partial products in and 
out and carries in and out 1 and 2. Accordingly, in systems constructed in accordance with those 
aspects of the present invention which employ a plurality of similar processing units, the overall 
system is constructed by starting with the circuit shown in Figures 4 or 4A as a rightmost 
position and placing, in adjacent positions, processing elements similar to those shown in Figures 
5 or 5A. The overall configuration, therefore, is seen in Figure 7. 

However, before proceeding, it is useful to consider the fact that the leftmost processor 
3element PEd does not have to be as complicated as the processing elements to its right such as 
jthese shown in Figures 5 or 5 A. In particular, the leftmost processing element only needs to 
^process k bits. In the X-phase of operation, the circuit shown in Figure 6 acts to add carry-in 1 to 
=5the partial product input to the leftmost processing element via increment-carry circuit 375. 
= jLikewise, adder 385 adds carry-in 2 to the other input to adder 385 to produce an output which is 
^ ^supplied to register R2d in the immediate preceding processor element. In the Z-phase of 
^operation as controlled by AND-gate 399, the contents of register R2i (reference numeral 390) 
3are added to the output of increment carry circuit 375 and this is also supplied to register R2i in 
;Sthe feedback configuration as shown. Accordingly, it is seen that in partitioned embodiments of 
the present invention, it is preferable to employ a leftmost processing element which is simpler 
than that which is generally required in one of the generic processing elements between the 
rightmost and leftmost elements. However, while preferable, this substitution is not mandatory. 

The partitioning of the computational problem as provided in one embodiment of the 
present invention into a solution having a plurality of nearly identical processing elements 
provides significant advantages in terms of design, efficiency, layout, and structure. 
Concomitantly, these advantages also lead to advantages in circuit speed and throughput. 
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However, it is also very important to note that the partitioning into a plurality of processing 
elements also provides significant advantages in terms of the fact that a pipelined operation is 
now possible. In particular, while pipelined operations generally introduce a small initial delay, 
the total throughput, as measured in terms of modulo multiplications per time unit is 
significantly improved. Accordingly, a significant portion of the description below is devoted to 
a discussion of the use of the described partitioned processing element structure in conjunction 
v^th a pipelined method for operating the circuits shown in Figures 4, 5, and 6, and variations 
thereof 

However, before embarking on a discussion regarding the pipelining aspects of the 
present invention, it is also useful to note that the circuits shown in Figures 4-7 are perfectly 
capable of operation in a non-pipelined fashion. Such a mode of operation is illustrated in Figure 
38. In particular, it is noted that Figure 8 is a logical time-sequence diagram illustrating the use of 
jthe register set Ro through R33 as a final and temporary storage medium for passing information 
ibetween the X-phase of computation and the Z-phase of computation. Figure 8 also more 
^^particularly illustrates the distinction pointed out above between the register set and the 
rjindividual processing elements. This figure also illustrates the unique positions for the rightmost 
■ _and leftmost processing elements wherein the rightmost element is supplied with information 
IJfrom three registers and wherein the leftmost processing element receives direct information only 
] jfrom the leftmost portion of the register set, namely, R33 since, in this particular case, n is 
ji^ssumed to be 1,024 and k is assumed to be 32. Not shovra in Figure 8 are the signal connections 
between the processing elements. Rather, Figure 8 is meant to be illustrative of time sequencing 
and the utilization of the register set. In particular, it should also be noted that, in Figure 8, the 
processor elements in the upper half of the illustration are all operating in the X-phasc at the 
same time, and likewise, all of the processing elements in the lower portion are operating in the 
Z-phase. Variations of this operational modality are more particularly described below with 
respect to Figure 9 and considerations relating to pipelining of the information into and out of the 
circuit. In the case of no pipelining, such as shown in Figure 8, all of the processing elements 
start to process data at the same time and finish at the same time. In any given clock cycle, all of 
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the processing elements are either all in the X-phase or are all in the Z-phase of calculation. In 
this node, each processing element updates a fixed slice of the complete partial result register 
(two Ri registers). Since all of the partial product registers are updated at the same time, 
everything works smoothly in accordance with the algorithm described above. 

5 Attention is now directed to that aspect of the present invention in which the processing 

elements are operated in a pipelined fashion. In order to achieve this result, certain hardware 
modifications are made to the circuits shown in Figures 4 and 5. These modifications are more 
particularly illustrated in Figures 10 and 11, respectively, to be discussed more particularly 
below. 

1 0 However, for purposes of better understanding the utilization of the processing elements 

in a pipelined fashion, attention is specifically directed to Figure 9. In the pipelined approach, it 
pis the case that, in a given clock cycle, any two adjacent processing elements are always in 
^different phases with the processing element processing the less significant slice of data always 
;.|rbeing one clock cycle ahead. As seen by the circular arrows in Figure 9, it is unfortunately the 
1 5 =:Scase that, while a given processing element is in the Jf-phase, it requires, as input, a 32-bit value 
Jfrom the Z-phase that is being calculated at the same time by the next processing element in the 
^' chain that is still in the previous Z-phase. For example, as shown in Figure 8, the rightmost 
I L^rocessing element PEo on the top right is in theX-phase. This requires, as an input, the value in 
I ^2 from processing element PEi which is one clock cycle behind in the Z-phase. This problem is 
20 J;3olved by adding a feedback paths from the next processing element in the chain, which links to 
a A:-bit adder (see reference numeral 235 in Figure 10 and reference numeral 135 in Figure 1 1). 
This solution creates additional delay due to the presence of a new k-hit adder. However, the 
maximum working frequency is not significantly affected since a ^-bit adder is a relatively fast 
circuit. Additionally, it is noted that the previous signal path, before this change, was not a 
25 critical path. The original critical path occurred in the rightmost processing element PEo due to 
the calculation of the constant 3^. The advantage to this particular solution is that there is no need 
to modify the formulas in the algorithm; however, on the other hand, the maximum frequency is 
nonetheless slightly effected. Additional variations, to be considered more particularly below, 
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consider this minor problem and provide yet another solution which eliminates the delay 
introduced by adder 235 and 135. In any event, either of the tv^o pipelining solutions presented is 
an improved solution over that provided by the purely parallel approach illustrated in Figure 8. 

As pointed out above, Figure 10 is similar to Figure 5, but more particularly illustrates the 
inclusion of extra hardware elements that are used to achieve smooth operation in a pipelined 
fashion. In particular, latches 232, 233, and 234 are added as temporary storage mechanisms 
between processors elements for holding the A: bit wide partial products out (PPO), and the single 
bit carry-out lines 1 (from adder 275) and 2 (from adder 285). Additionally, it is noted that latch 
23 1 stores either the selected k bit wide portion of multiplier factor Ai or the constant j;. This is 
provided in an alternating fashion from multiplexor 131 (as shown in Figure 11). Additionally, it 
is noted that the lower k bits from the output of adder 285 are supplied to the adjacent adder 235 
rJwhich is actually present in the preceding processing element, namely the one to the right. In a 
: jsimilar fashion, the lower k bits from the next (that is, the left) processing element are supplied to 
; gadder 235. Additionally, there is a feedback connection (not shown for reasons of drawing 
^ Scongestion) from the output of adder 235 to the corresponding segment of the register "set," 
^^jnamely, to R2p+i. 

i y Similar changes in the circuit are made to the rightmost processing element PEo, as shown 
An Figure 11. In particular, latches 131, 132, 133, and 134 are added to serve a function that is 
':She same as that provided by latches 231, 232, 233, and 234 in Figure 10. And as in Figure 10, 

y 

adder 135 is now included to incorporate the extra addition step for pipelined operations. It is 
also noted that latch 13 T in Figure 1 1 is supplied from multiplexor 131. It is from this latch that 
values of Aj and;/ are supplied to subsequent processing elements in the chain. In this regard, it 
is also noted that register 1 10 containing the value Aj is illustrated in Figure 11 as a A: bit register, 
while in fact the preferred embodiment is the one illustrated in Figure 1 in which a long A 
register with « + 7 bits provides information to a multiplexor which selects subsequent k bit wide 
chunks from the contents of the A register. Accordingly, register 1 1 0 in Figure 1 1 is preferably 
constructed as illustrated from register 10 and multiplexor 1 1 in Figure 1. The simplification 
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shown in Figure 1 1 is only for clarity and for ease of understanding. Also, as is seen in the 
corresponding portion of Figure 4, the output of multiplexor 121 is preferably W -\' k bits wide 
where ^is the width of the data chunks processed by each of the generic processing elements. 

Before proceeding to a discussion of yet another preferred embodiment of the present 
invention, it is worthwhile to consider the development described so far so in order to provide 
some overall perspective. In particular, a first preferred embodiment of the present invention 
provides a circuit such as that shown in Figure 1 which employs relatively large multiplier and 
adder arrays. In a second preferred embodiment, the adder and multiplier arrays are partitioned 
so as to be deployed in a chained sequence of individual processing elements with each one 
possessing the same structure and passing information from the rightmost to the leftmost 
processing elements in a system which efficiently carries out the same operations as shown in 

:3Figure 1 . In a third preferred embodiment of the present invention, the processing elements are 

.3 

> jfurther provided with an additional adder and latches which enable the processing elements to be 
:^operated in a pipelined fashion, such as illustrated in Figure 9. In the next preferred embodiment 

=<)f the present invention which is now considered in detail below, additional adders 135 and 235 

.n 

= jare repositioned in the circuit so as not to negatively impact critical dataflow paths. It is now this 
■ embodiment which is described. In particular, in this embodiment, the processing elements and 
ili'egister sets are configured as shown in Figure 12. In particular, it is noted that, in Figure 12, the 
J yegister connections to the individual processing elements are in fact different. This difference is 
]; jdue to the repositioning of the adder. 

In particular. Figure 13 illustrates the repositioning of adder 135 from Figure 1 1 and 
likewise. Figure 14 illustrates the repositioning of adder 235 from Figure 10 to the position 
shown as adder 435' as shown in Figure 14. Accordingly, the design illustrated in Figures 10 and 
1 1 for pipelined operations is improved even further by moving the indicated adder to the input 
stage of the processing elements which is facilitated by eliminating certain feedback paths 
between the processing elements, as shown. The adder is moved from the output of the 
processing element to the partial product input (R register path) and works in parallel with the 
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slower multiplier function blocks. This eliminates an adder from a critical path. From Figure 9, 
it can be seen that when processor element PEp is in the X-phase, it requires an input from both 
register portions R2p+ 2 and R2P+1. The R2p+ 1 value is actually updated by the p'^ processor 
element during its previous clock cycle. The "problem" is that the value in R2P + 25 which is 
5 supposed to be contain the value of Z2P+2 is updated in the same clock cycle by processor 

element p + 7 (PE p + 1). It is noted that during the Jf-phase, processor element PEp adds the value 
Z2P+2 contained in R2P+2 to the upper k bits of its output and loads the result into R2p+ 1 (this is the 
X2p+ 1 value). Given that the contents of register R2p+ 1 are used and updated exclusively by PEp, 
one can proceed as follows: (1) during the X-phase, processor element PEp does not add the 
10 value of R2P-H2 to its output before loading R2p + 1; and (2) during the Z-phase PEp receives as an 
extra input, the value in register R2p + 2 (which at this time has been updated by PEp+ 1 with Z2p+ 2 
and adds this immediately to the R2p+ 1 input before any further processing). The modifications 

4^0 the circuit shown in Figure 11, which are illustrated in the circuit of Figure 13, are designed to 

^ jaccomplish these goals. 

15 The consequence of step (1 ) recited in the previous paragraph is that at this point the 

^ jvalue generated by the processing elements during theX-phase is not any more the same as 
j\^described in the algorithm set forth above. In order to compensate for this difference, another 
t ifterm is added during the Z-phase. The benefit of this change is an increase in the maximum 
: ^frequency of operation and a reduction in the power of the needed by the circuit. Additionally, 
20 ;:ithere are also advantages in terms of a reduced need for silicon area (that is, chip "real estate") 
together with advantages in having a more uniform and repeatable circuit design. Accordingly, 
Figure 12 illustrates the new flow of data between the R register "set" and the processing 
elements. Likewise, Figures 13 and 14 illustrate the presence of additional circuitry to 
accomplish the objectives stated above. 

25 The specific changes to the rightmost processing element for the improved pipelining 

version of the present invention are now specifically set forth. As above, a partial product out 
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from multiplier 173 is latched up into k-hiX wide register 432. Additionally, the variable M from 
multiplexor 131 is latched up into latch 437. 

Repositioned adder 435 is an adder having a width of 2k bits. It also receives a carry 
input signal (carry-in 5) and includes two input signal lines. A 2k bit wide signal comes from a 
5 combination of the output from AND-gate 402 which is supplied from register Ri (reference 

numeral 191). Register 191 also supplies multiplexor 193 which has as its other input the A: bit 
output signal from register Ro (reference numeral 190). The output of multiplexor 193 under the 
control of the "X/Z Select" signal line which causes the supply of either the output of register Ri 
or register Ro as the rightmost k bits for the right input to adder 435. (Note though that adders 
10 and multipliers are symmetric with respect to the use of left and right inputs since the desired 
operations are commutative.) The first (rightmost) 2k bit input to adder 435 is either (Ri , Ro) or 
(000 ... 0, R,) depending on the "X/Z Select" signal being 1 or 0 , respectively. The "X/Z 
ii^elect" signal configures the circuits for A'-phase or for Z-phase operation. During the X-phase, 
s jadder 435 executes the following operation: (00 ... 0, Ri) + 0 which result is sent to adder 135. 
15 !:3n comparison with Figure 1 1, it is seen that adder circuit 185 in Figure 13 receives (Ri, Ro) but 
'^^an also receive the additional signal input (R2, 00 , . , 0). The reason for this option is based on 
vj)ipelining operations because in such a mode the Processing Element (PE) on the left is always 
Jbehind one clock cycle. For example, since PEi in Figure 1 1 is responsible for updating the R2 
i Lregister with the Z value, this means that during the X-phase PEo needs the Z value stored in R2 
20 \-in PEi which is still generating it. Thus, in Figure 11, adder 135 is used to transform the X value 
::3n R2 to the successive Z value. However, in contrast in Figure 13, the value in R2 is added later 
in the next phase (a Z phase) via adder 435 which is not in a critical path. 

The signal "Select R2" is always 'zero' while the signal "X/Z Selecf ' controls the X and 
Z phase during modular multiplication. This signal, when set to 'one' provides the capability of 
25 performing regular multiplication as opposed to modular multiplication as needed, or as desired. 
For regular multiplicafion, the "A7Z Select" signal line is always "zero" while the "Select R2" 
signal line is always "one." 
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The other input to adder 435 is a 2k bit wide signal whose rightmost k bits , driven by the 
AND-gate 401, are all zeros during a modular multiplication or equal to the Register R2 value 
during a standard multiplication as determined by the signal "Select R2". The output of AND- 
gate 401 is connected now to the lower k bits of the leftmost 2k bit input to adder 435. The 
5 leftmost k bits of this second input comes from register R2 (reference numeral 192) under the 
control of the "X/Z Select " signal line which controls AND-gate 403. AND-gate 403 is, like 
multiplexor 193, also under control of the "X/Z Select" signal line, as shown. The 
reconfiguration of the adder's input signals is necessitated by the repositioning of adder 135 to a 
position which is not in a time-critical path. 

10 The fimctioning of signal line "Select PEo" is now more particularly described. The 

inclusion and ftxnctioning of this control line is not related to the repositioning of adder 435. 
t Swhen signal line "Select PEo" is "one" the hardware in the processing element becomes 
v^equivalent to the generic hardware processor element P/ (1 <i < d). When the "Select PE^" 
Jil^ignal line is set to "one," multiplier 406 selects the "Previous P" input signal bus and provides it 

15 Ao adder 175 (which is equivalent to adder 275 in PE, ). The output of AND-gate 405 changes 

■: ' s 

^:,|rom "zero" (in the case of PEo ftinctioning) to the value driven by the carry input signal line for 
Jl^adder 175 (or 275 in PE, functioning). Multiplexor 404 selects the "Carry In 2" signal line and 
j i|)rovides it as a carry input to adder 185 or 285 in PE/ ftinctioning). Accordingly, the "Select 
: JEo" signal line is used to "disable" the following devices so that the processing element operates 
20 Jras a generic PE, rather than as PEo : multiplier 171, adder 165, multiplexor 152, multiplier 180, 
register 150 and register 160. 

There are two cases in which it is desired that the "Select PEo" signal line should be 
driven into the "one" state. This means that the PE behaves specifically like a generic PE, as 
opposed to the rightmost PEo. 

25 The first case is when the system is designed comprising two separate chains of 

Processing Elements. For example, each of the two chains is made up of a concatenation of one 
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PEo together with seven PE/'s (that is, with eight Processing Elements per chain). These two 
chains (with eight PE's each) are particularly useful in carrying out operations of modular 
multiplication involving public key cryptography algorithms such as the RSA algorithm using the 
Chinese Remainder Theorem (CRT). In such cases, each of the two chains operates 
independently to perform two modular multiplications. In the case of modular multiplication as 
described above, there is thus provided a command which effectuates this operation together with 
an exponentiation function which is described in more detail below. In this case, the two chains 
of Processing Elements are concatenated to form a longer chain that is thus able to process more 
data in the same amount of time. In this case, the "PEo" on the rightmost position of the left 
chain behaves as a PEy and receives the inputs from PE7 (here "7" is used as an example which is 
in harmony with the exemplary chain size of eight, as recited above) from the right chain. This is 
accomplished by setting the "Select PEo" signal to "one." These two chains may be represented 
3diagrammatically as follows: 

Z PEtb PE^^ . . . PE/bPEo5 < > PE//! PE^^ . . . PEmPEo.4 

: Jn the event that the hardware herein is not being operated in the Chinese Remainder Theorem 
' jaodQ (to be discussed in more detail below), PEqb acts as a PE, and its "Select PEo" signal input 
Hine is set to "one." There is also one other input control signal that is set to "one" in order to 
, Jiave ?EoB act as a PE/. In particular, this signal line is labeled "Auxiliary Select" in Figure 13. 

s, _£ 

o 

More particularly, control line "Select PEo" controls the operation of multiplexors 404 
and 406 and AND-gate 405. In the PEo mode of operation, the carry-in 1 signal line is supplied 
to adder 175 together with the signal from the previous PE signal line coming in to the modified 
rightmost processing element shown in Figure 13. If it is not in "PEo mode," no carry input is 
supplied to adder 175. Likewise, based upon the state of the "Select PEo" signal line, multiplexor 
404 operates to select, as a carry input to the low order position of adder 1 75, either the usual 
carry-out signal from adder 165 or, in the event of non-PEo mode operation, the signal supplied 
to the carry input of adder 185 is the carry-in 2 signal. Apart from these variations, the rest of the 
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circuits shown in Figure 13 operate in substantially the same manner as their counterparts in 
Figure 1 1 . 

Figure 13 also introduces several other signal lines for proper operation in various 
hardware modes. As described above the "Auxiliary Select" signal line is a 2 bit signal taking on 

5 the values "00," "07," or "70." The "Auxiliary Select" line has the value "70" to PEob above to 
concatenate PEob with FEta on its right in the case of non-CRT operation. This is the only time 
that the "Auxiliary Select" signal bus is set to this value. In the other cases, this signal line is set 
to "07" during the Z-phase (Select X^Z = 7). The "00" value of "Auxiliary Select" selects the A/ 
input used for the X-phase while the "07" value for this signal line selects the Y input for the 

1 0 Z-phase of operation. 

With respect to the other signal lines present in Figure 13, the "Select 7? or Jf' signal line 
|;3s equivalent to "Select X/Z"; and the "Select R2" signal line is driven independently when the 
slProcessing Elements are used to perform standard multiplication operations as opposed to 
^iJnodular multiplication. The "Select B or A^' signal line assumes the value given by "Select X<Z" 
15 .:31uring the next clock cycle (that is, the anticipated version of "Select ^Z"). The reason for this 
Js that the output of multiplexor 121 is used to select what is stored in BN register 198 which 
contains B during anX-phase and n during a Z-phase. 

. 2 Figure 14 illustrates modifications made to the circuit shown in Figure 10 to 
Accommodate repositioning adder 235 in Figure 10 to a position in the signal flow path which 

20 reduces time criticality with respect to addition operations. With respect to the specific 

differences between Figures 10 and 14, it is noted that, in Figure 14, it is no longer necessary to 
supply the low order k bit output from adder 285 to the processing element to the right. 
Additionally, it is noted that instead of the signal line being labeled Ai /y, the input signal line is 
labeled M to reflect the fact that multiplexor 131 in Figure 13 now has three possible inputs to 

25 select from rather than just Ai orj;. The third input of multiplexor 131 (that is, the "Previous M" 
signal line) is used to concatenate FEob to VEja (as per the example given above) during 
non-CRT operations. This allows on-the-fly construction of a long chain of Processing Elements 
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(sixteen in the example) versus two independent chains of half as many (that is, eight in the 
example) Processing Elements. 

Additionally, adder 435' which is 2k bits wide is now interposed between its 
corresponding register set segment and adder 285. In particular, the output of adder 435' is 
5 supplied as the second input to adder 285 and the carry out of adder 435' is supplied to latch C3 

(reference numeral 436) which supplies the carry-out 3 signal line. The contents of register R2p+ 
2 (reference numeral 292') which is k bits in width is supplied as the lower k-hit portion of the 
left adder input under control of AND-gate array 401 which is in turn controlled by the signal 
line "Select R2P + 2." The contents of register R2P+2 are also supplied as the upper k-hit portion of 
10 the left adder input under control of AND-gate array 403 which is in turn controlled by the "X/Z 
Select" signal line. The right input to adder 435' is also 2k bits in width and is supplied from 
uAND-gate array 402 and from multiplexor 493. Under control of the "X/Z Select" signal line, 
^jnultiplexor 493 provides either the contents of register R2p+ 1 (reference numeral 291') or the 
' a^ontents of register R2p from the processing element on the right. The 2A:-bit data portion 
15 - supplied to the left input of adder 435' is controlled by AND-gate 401 and by AND-gate 403 . 

- The right 2A:-bit input to adder 435' includes two portions one of which is a high order k bit wide 
I [portion which is either zero or the ^-bit data portion coming fi*om register R2P+2 (reference 
[ 4iumeral 292' ) control of AND-gate array 401 which is also under control of the "Select R2 " 
5;=^ignal line. The lower order k bit wide portion of the right input to adder 435' is selected by 
20 multiplexor 493 to be either the contents of register 291' (that is, R2p+ 1) or the contents of the 
292' register (that is, R2p) in the processing element to the right. The operation of the circuits 
described produces the result that adder 285 (Figure 14) accumulates the results of the 
multiplication operations performed by multipliers 272 and 273 together with the output of adder 
275. The left input of adder 285 is dependent on the phase of the operation for the Processor 
25 Element containing adder 285. For example, during theX-phase, the resuh is (00 . , .0, R2i+i) 

while during the Z-phase, the result is the binary sum (R2i+i, R2i) + (R2i+2, 00 . . . 0), where "00 . . . 
0" is k bits wide. The term including R2i+i is added only during the Z-phase since, during the 
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A'-phase, this register value is still being updated by the Processing Element to the left. This 
aspect is best seen in Figxire 12. 

Additionally, it is noted that if one desires to employ a simplified leftmost processing 
element such as one that is similar to that shown in Figure 6, modifications are made to this 
5 circuit to accommodate the improved pipelining version associated with Figures 13 and 14. In 
particular, this is accomplished by the inclusion of an increment-carry circuit 439 between 
previously employed AND-gate array 399 and k bit wide adder 385. The other signals supplied 
to increment carry circuit 439 is a carry input dn which comes from latch 436 in the processing 
element to the immediate right of the circuit shovm in Figure 15. In particular, this signal line is 
10 designated as carry-out 3 in Figure 14. As above, the use of a simplified leftmost processing 
element (PEd) is optional but is clearly desired for purposes of circuit simplification, speed, and 
lixost. The Processing Element PEend or PEd includes the fimction of adding the previous PPO 
•ItP^ial Product Out) from the PE to its right to the potential carry out signal from adder 435' 
=-nvhich signal is temporarily stored in latch C3 (436). This result is stored in register R2P . During 
15 ,=5he Z-phase, the result of this operation is accumulated in register R2P , as shovm. 

It is noted that it is also possible to utilize the pipelined version of the present invention 
I |o process operands that are actually in fact wider than the hardware present in the processing 
; element chain width {n » Wd or equivalently n :> mk) . The method for carrying out this extra 

iiivide operation processing is illustrated in Figure 16. In particular, each horizontal line in Figure 

U 

20 16 represents a single clock cycle and each vertical column represents a slice of the data that is to 
be processed. Assuming that each processing element processes 64 bits of data {2k bits 
typically), the first column indicates that the lower two k bits of the data are always processed by 
processing element PEo. During the first clock cycle, only processing element PEo is active. All 
of the other processing elements are activated sequentially, clock cycle after clock cycle. This 

25 provides sufficient time to the previous processor element to generate the pipelined data for the 
next processing element. In fact, it is possible that the width of the operand is larger than the 
processing element chain itself For example, in the discussions herein, the situation in which 
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n = 512 bits has been considered. However, in accordance with this aspect of the present 
invention, it is possible to process operands that are longer than 512 bits using a pipelined 
hardware structure which is designed for 512 bits. In such circumstances the clock cycle after the 
first processing element is activated, the entire processing element chain is shifted left by 2k bits 
(see Figure 16) leaving the lower two k bits unprocessed. This shifting continues until the upper 
processing element (in this case, PEg) is capable of processing the upper 2k bits of the operand. 
Following this, the processing element chain, instead of shifting back to the home position, stays 
in place with the exception of the rightmost processing element PEo. The lower processing 
element, after the others go into a home position, continues processing the lower two k-hit slice 
of the operand. When all of the processor elements are back in their home positions, the entire 
chain starts a shift left as before. This mechanism allows all of the processing elements to be 
busy all of the time and, accordingly, achieves a maximum performance level. Additionally, a 
Cliew operation can start before the previous operation is finished. The approach described herein 
jprovides maximal performance in the sense that all of the Processing Elements are always busy. 
^:3^dditionally, the next operation can be started immediately without any delay and without idling 
=:^y of the Processor Elements. Furthermore, these operations are fully compatible with the 
j>ipelined approach as described above. 

" 

1 IJ As indicated very early above in the description for the present algorithm for computing 
; mod A^, it is desirable to begin the calculation with a value s which is equal to the negative 
J; Inverse of the value No where the inverse is now taken modulo R where R = 2*. That is to say, in 
the initial presentation of the algorithm employed herein, the availability of the value s = -J/No 
mod R was assumed. A circuit for carrying out this calculation is illustrated in Figure 17 which 
shows, in its upper portion, a circuit for calculating successive values of the variable Q and 
correspondingly illustrates a circuit in its lower portion for calculating a companion variable S 
which ultimately becomes the desired s = -lINo mod 2*. In this regard, it is noted that the circuit 
shown in Figure 17 actually performs two operations. Firstly, it computes a multiplicative 
inverse modulo, a number which is a power of 2, and also at the same time computes the additive 
inverse of the multiplicative inverse. In ordinary, non-modular arithmetic, the computation of an 
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additive inverse is a relatively simple operation requiring either the addition or change of a single 
bit at the leftmost portion of a representative number or at most the addition of a 1 to the low 
order position depending upon the format in which the numbers are stored. However, in the case 
of modular addition, it is noted that the operation cannot be carried out as simply as it is for 
ordinary, non-modular arithmetic. Accordingly, it is noted that the circuit shown in Figure 17 
actually carries out simultaneously two nontrivial operations modulo /?. In particular, it 
computes a multiplicative inverse while at the same time ensures that the final result is the 
negative additive inverse modulo R = 2*. 

In the context of the present invention, the algorithm set forth above for computing AB 
mod employs the variable s = -1/No modulo R, However, the circuit shown in Figure 17 is 
capable of generating the negative multiplicative inverse of any ^-bit number A initially stored in 
i jthe No register (reference numeral 501). The method employed for carrying out the formation of 
: Jthe desired negative multiplicative inverse is set forth below. The inputs to the process are the 
i^values k and the number whose negative multiplicative is desired, namely, A which is 
^Expressible as an ordered ^-tuple of the form {ak-h - • - ^ cih cio)* The desired output of this 
^ .process is a variable s = -l/A modulo 2*. In the process described below, the variable s is 
I initially set equal to the value 2^ - L The variable A is also initially loaded into the Q register 
= preference numeral 504) at the start of the process. Accordingly, if the "Start" signal line is "7," 
^:3hen multiplexor 505 selects as its output the contents of register 501 which contain the value No 
rpr, more generally, a variable A whose negative multiple inverse is to be generated. Multiplexor 
505 also receives as an input the output of k bit adder 503. This adder has two inputs, namely, 
the leftmost k - J bits from Q register 504 and a k bit input the value of A as stored in register 
501. Adder 503 also effectively performs a shift right operation under circumstances to be 
described more particularly below, and accordingly, a zero high-order bit is added as appropriate 
to effect this shift operation with zeros being shifted into the high-order position. 

The process for carrying out the desired calculation resulting in the variable S being 
transformed to -1/A mod 2^ is set forth below: 
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Set5 = 2*-7 
Set 2 = ^ 

FoTi = 1 to (k'l) do: 
Right shift Q one bit 
5 If rightmost bit of Q, namely Qi = 7, then 

S = S-7\ 

end if; 
End for. 

10 Accordingly, it is seen that the process in this embodiment of the present invention occurs 

"S-in k - 1 steps. At the last step, the contents of the S register are equal to the desired negative 
^ultipUcative inverse of A (or No for the specific purposes of the present invention). It is also 
l^een that the process for calculating the negative multiplicative inverse employs the concomitant 
I fpalculation and updating of two variables, S and Q. The upper portion of Figure 17 illustrates the 

15 Updating and calculation of the variable Q, In particular, it is noted that if the rightmost bit of Q 
\'%that is, Qj) is 1 then, via the utilization of AND-gate array 502, the contents of register 501 are 
Liadded to the current value of Q from Q register 504 with the output being stored back in the Q 
:;Jegister via muhiplexor 505. It is noted that, at this stage of operation, the "Start" signal line is 
f:Siot equal to "7" and, accordingly, multiplexor 505 selects as its input the output of adder 503. 

20 Otherwise, the initialization Q = A is carried out. 



The circuit in the lower portion of Figure 17 calculates the companion variable S which 
is also the desired output at the end of the process. It is noted that in the updating of the variable 
iS, in accordance with the process indicated above, one performs a subtraction from the current 
value of 5 by an amount which is equal to a power of 2 (iS = 5 - 2' ). To effect the desired 
25 process, S register 560 is initially loaded with a value which is "all ones" representing the integer 
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2* - 7. AND-gate array 561 controls the writing of particular bits into the S register. In 
particular as seen in Figure 17, a A: bit wide vector from AND-gate array 561 is available for 
writing into register 560. AND-gate array 561 permits, during each clock cycle if necessary, the 
writing of a A: bit vector into S register 560. The selection of which vector is controlled by the 
current value in counter 563 which counts upwards from OXok- 1, and then immediately back to 
zero again in a rollover fashion. In the examples of the present invention described above, k is 
typically equal to 32 bits. As such, counter 563 need contain only 5 bits. In general, counter 563 
contains kf = log2 L Thus, decoder ring 562 receives = 5 bits and produces as an output a k 
bit vector, only one of whose entries is 1 . This is the essential operational feature of a decoder 
circuit. Counter 563 also supplies a signal line "ZeroCount" which is a "7" when the counter is 
all zeros. This signal line is also supplied to AND-gate array 561 which triggers a write-enable 
bit when Q(l) is "7" and the ZeroCount signal line is false and the Start signal line is false. 
J ^Accordingly, under these circumstances, AND-gate array 561, in accordance with the algorithm 
^described above, then permits the v^iting of a 0 bit into the corresponding portion of S register 
^;3560 as determined by the current value in register 563 which, in effect, contains the variable / 
i Srecited in the algorithm listed above for negative multiplicative inverse calculation. It is in this 
' Jfashion that the value of S is updated to 5 = - 1. Finally, at the end of the calculation, the 
I 'Value in the S register, which is initially set equal to all ones, is now equal to the negative 
j^inultiplicative inverse modulo 7? of the value that was stored in the No register 501 . 

i 1 
-I 

O If instead of (-1/A) mod N, one wishes to calculate (1/A) mod N, one can employ the 
following algorithm: 
Set5 = ; 
Set 2 = 

¥oxi = 1 to (k- 1) do: 
Right shift Q one bit 

If rightmost bit of Q, namely Qi = 1, then 
S = S +2' (that is, set bit i to I); 
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Q = Q + A 
end if; 
End for. 



Accordingly, there is provided a circuit and a process for producing in a single set of 
operations not only the multiplicative inverse modulo 7? of a given number, but also, its 
arithmetic negative value modulo the same value R, For purposes of the multiplication algorithm 
ofAB mod described above, it is noted that it is the circuits shown in Figure 17 which are 
preferably employed for the calculation of the variable s = ~l/No mod R which is stored in 
registers 60 in Figure 1, 160 in Figures 4 and 4A, 160 in Figure 11, and 160 in Figure 13. 

As discussed above, a primary purpose of the present invention is the multiplication of 
Charge integers modulo A'^ for cryptographic purposes. Since cryptography often involves the 
-..exponentiation operation, the use of the present hardware to perform exponentiation is now 
!;51escribed. 

s J The relevant circuits and materials described above can be considered as implementing a 
";■ ^specific function,/ with the following properties: 

; 5 f(A, B) = A B T""" mod N; 

[: J f(A 2"*, B2'"'')=AB 2"* mod N; 

f(A2""', l)=AmodN; 

\fA < 2^ and 5 < 2", Xhenf(A, B) < 2N; and 

if A < 2^ and ^ 9t A^, then f(A, 1) < N. 

In the above, the problem has been partitioned into m "words" of k bits each where mk>n + 2 
where n is the number of bits in the binary representation ofN. And as above. No is the least 
significant k bits of N. And N is, of course, odd. 
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In the discussion above, it was pointed out that multiplication modulo A'^ would normally 
be carried out in a two step process: 

Step 1 : Result I =f(A, B)=AB l""^ modN 
Step 2: Result2 =f(Resultu 2'""^) ^AB modN. 

From the above properties of f, it is seen that premultiplication of either ^ or 5 by 2^"* produces 
the same result in one step: 

Result =f(A 2^"^ B) =f(A, B 2"'^ )=AB modN. 

This is clearly the preferred approach for performing modular multiplication in one shot situation 

i jince premultiplication by 2^"* is easily performed via a shift operation. However, in the case of 
Exponentiation, one uses the modular multiplication function, as implemented in the hardware 
-described above, in a repeated fashion. In the present case then, exponentiation is carried out in a 
^ jepeated fashion, but now one must deal with the fact hat there is a factor of i '"* present in the 

J 'output of each iteration of the function, f; that is to say, f(A, B) =AB 2"'"* modN, Accordingly, 

ii in the present invention, the hardware implemented function f is used but with the factor 2^"^ 
racing "preapplied" to both of the multiplicands, A and B, as follows: 7(^4 2"'\ B2'"^)=AB 2^'"* 

^ mod N. This way, since the function f introduces a factor of 2 ^"* at each step, repeated iterations 
f;Bsing preapplication of the 2"'^ factor to both operands keeps a constant factor of 2'"^ as part of 
the result. As a last step this factor is removed using the function f as implemented by the 
present hardware in the following manner: f(A 2"'\ 1) = A mod N, Therefore, at the last iteration 
in an exponentiation operation, A is the output from previous repeated applications of the 
function / 

In order to see that this value of A going into the f function hardware at this stage is 
constructed as an appropriate exponential, consider the general case of constructing the value 
mod where £ is an integer and in particular is an integer represented by the / + 7 bit binary 
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value e,2' + e,.y 2'"' + - • + e2 2^ + ey2 + = 2 2', where is either "i" or "0." Here, 
advantage is taken of the fact that a sum in an exponent becomes a product ((f cfcf) so that: 

Based upon this expression for ^^in terms of the binary integer E, it is seen that the following 
algorithm provides a method for using the hardware for the function f herein to produce the result 

mod N, a result which is very important for cryptographic operations and particularly 
important for public key cryptographic systems. Here, M K, M, No and s{=- 1 /No mod R 
where = 2*) are as given above. The inputs to the method are the values A and E with E being 
a r + 7 bit binary integer. The method is summarized in the following outline: 

% Sex C = 2^"'^ mod N 
i Zo =f(A, C) 
Z = Zo 
For / = 7 to ^ 
Z-f(Z,Z) 

If et-i = 7, then Z = f(Z, Zo ), else continue 
End For 
Z =f(l Z) 

i;Thus, at the end of this method the value stored in the Z register is A^ mod N, as desired. This 
procedure is also summarized in the flow chart shown as Figure 18. 

A slightly different form of the exponentiation algorithm is implemented in Figure 19. It 
is also described in the pseudo code provided below: 

Set C-2^'"^ modN 
Zo =f(A, C) 

If Co = 0, then set Z = 1, else set Z = Zo ^ 
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For / = 7 to r 

Zq =f(ZOyy Zo) 

If et = U then Z =f(Z, Zo), else continue 
End For 
Z-f(lZ) 

In constructing circuits for implementing either of these methods for modular exponentiation, it 
should be noted that f is a symmetric function so that f(A, B) = f(B, A). If /is instead viewed as 
an operator, this condition is referred to as commutivity. Thus, circuits implementing/ can have 
their inputs switched with no change in operation. One also notes in the algorithm set forth 
inmiediately above that eo is the lowest order bit in the binary representation for the exponent E . 
As such, for the cryptographic purposes described herein, one notes that N is an odd number. 
Thus, it's lowest order bit position is always 1. Thus, for cryptographic purposes the step which 
; Jests to see ifeo = 0 can be eliminated. 

As an example, a circuit which can implement either one of the algorithms for 
Exponentiation is shown in Figure 20. The core of this exponentiation circuit is provided by an 
engine which implements the f(A, B) = A B 2'"^^ mod N function. Thus, engine 600 may be 
=; implemented by means of any of the hardware components described above which performs this 
[ [function. The output from multiplication modulo engine 600 is provided to decoder 603 which 
; ^^perates under control of finite state machine (FSM) 607 to store this output either in Z register 

'•1 =r 

^=^04 or in Zo register 605, or in both (to provide the Z = Zostep in the algorithm of Figure 18), as 
needed. Thus, decoder 603 does not always function in accordance within the standard 
operational definition of a "decoder" which would normally have only one set of output lines 
carrying information. If the circuit of Figure 20 is intended to implement either of the 
exponentiation algorithms herein, then the outputs of registers 604 and 605 (Z and Zo) are both 
provided as inputs to multiplexors 601 (for input A) and 602 (for input B). These multiplexors 
are also provided with constants 1 and C = 2^^"'* mod N. It is noted, however, that the constant 
"7" could also have been provided instead as an input to multiplexor 601. However, the constant 
C and the input A (which is used for computing A^mod N ) need to be provided to different ones 
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of multiplexors 601 and 602 for the purpose of calculating the value Zo = f(A, C). Multiplexors 
601 and 602 and decoder 603 all operate under control of controller 607 which is preferably 
implemented as a Finite State Machine which can have as few as 6 states which depend only on 
the contents of index counter 608 (which counts from 0 to / and then resets back to G) and on the 
selected bit d from register 606 which contains the exponent E in binary form. 

For example, in implementing the algorithm illustrated in Figure 18, when counter 608 is 
at 0, controller 607 selects the A input for multiplexor 601 and the C input for multiplexor 602. 
It is also noted that, for both algorithms, the initialization and repetition aspects both involve two 
steps. Accordingly, FSM 607 also includes one-bit register 609 (step state register) which is 
indicative of this step state. Having used multiplexors 601 and 602 to select^ and C as inputs to 
engine 600, FSM 607 also controls decoder (or router, if you will) 603 to store the output f(A, C) 
= AC 2'""^ mod N into Zo register 605. The design of FSM' s for such purposes is standard and is 

ijcvell known and is, for example, described in the text "Digital Logic and Computer Design" by 

|:iV[. Morris Mano, Copyright 1979 by Prentice-Hall. 

,,Z In the use of the CRT as described above it is seen that one requires the constant C 
v'Jiefmed as 2^^"'* mod N. While the constant 2^"'^ is generally easy to determine and construct, the 

inclusion of the need for this to be modulo A'^ is a complicating factor. Note here too that it is the 
rij^ase that mk^n + 2 where n is the number of bits in A'^ and that m is picked to be the smallest 
' integer satisfying this relationship. Thus 2'^^"'^ is always going to be greater than and hence the 
t^inodulo A^form is needed. However, this constant is readily calculable using the / engine 

described above. One first calculates j = 2'"''*' for a small value of t. The / engine is then used 

repeatedly as foUov^s: 

y(T, T) = 2"* " ' 2"'* " ' 2-"'* mod N, 

= 2"*"^' modN 
^^2"* - 2"-* ' ") = 2"* " ■" mod N, 
y^2'"* ^ ^'^ 2"'" " ") = 2"* " * mod N, etc. 
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This process is repeated until the first time that the result is greater than N. 



In public key cryptographic systems someone who wants to receive information picks two 
(large) prime numbers Np and Ng and publishes only their product N = Np Ng, The potential 
receiver then generates (or otherwise creates, often randomly) a public key E which is also 
5 published. Before publication, however, the receiver-to-be checks to make sure that E is 

relatively prime with the respect to the product (Np - I) (Ng - 7). This is easily done since the 
receiver knows both Np and Ng. With and E thus known to the public, anyone wishing to 
transmit a message A destined for the receiver can form the encrypted version c of the message 
by computing c =A^ mod A^, Thus, encryption is an exponentiation operation modulo N, It is 
10 the "modulo A^' aspect which makes this a nonstandard arithmetic problem. However, the 
systems provided herein are particularly capable of performing the mod A^ operation. 

=,1 

==: J At the receiving end the message is decrypted as y4 = mod A^, where, as above, c is the 
Jeceived/encrypted message and where i) is a private key known only to the receiver and which 

-4s calculated asD = E'^ mod [(Np- 7) (Ng - 7)]. This is something which can be computed by 

I ^ 

1 5 -- ihQ receiver since the receiver (and only the receiver) knows the values A^^ and Np. (Since N = Np 
INg is a large number, typically with thousands of bits, even though A'^ be known, its factors, the 
"= ^prime numbers A^^ and Np are very hard to determine. This fact lies at the heart of public key 
i Cryptography.) The receiver also computes, actually precomputes, several other values that are 

i : 

J;5seftil in efficient decryption. In particular, the receiver computes two values U, Dp and Dgas 
20 follows: 

U= (I/Ng) mod Np, 
Dp = D mod (Np- 7), 
Dg = D mod (Ng- 7). 

These values render it possible to more efficiently construct the desired result which is mod 
25 N. This process is more particularly illustrated in Figure 2 1 . (Coded message c is not to be 
confused with the constant C = 2"'^"'^ used above.) 
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Advantage is now taken of the fact that the receiver, knowing Np and Nq is able to 
calculate U, Dp and Dq so that advantage maybe taken of the Chinese Remainder Theorem. The 
coded message c is an integer between 0 and N ^ NpNq where gcd (Np, Ng) = J end where "gcd" 
stands for "greatest common denominator." If Cp= c mod Np and Cg = c mod Ng then the CRT 
5 implies that c may be computed as follows: 

^ ^ (Ng ( (^P~ <^g) i^od Np) U) mod Np, 
where U is as defined above. This result is now more particularly applied to the computation of 
<P mod A^, one first considers {cP)p which is defined as cP mod Np, Likewise, one also considers 
{c^)q which is similarly defined as (P mod A^^. Note that (c mod N^^ mod Np= (c mod Np)^^ 
10 where Dp= D mod (Np -I), Similarly, (c mod A^^)^ mod Ng= (cmodNg)^^ where, similarly Dg 
= D mod (Nq - Ij. Thus, given c, Dp, Dg, Np, Ng and U the exponential cP mod can be 

CScalculated in three steps: 

O 

"■ :,J 

Step 1 . Cp = c mod Np\Cg = c mod A^^. 

Step 2. {Cp)o = icp)"" mod Np ; (c,)d = icg)^- mod A^,. 

1 5 Cj Step 3 . mod N = (c,)d + [ A/, (( (Cp)o - (c,)d ) mod Np) U ] mod iVp. 

3* — 

^ -^Step 2 above is readily carried out using the methods set forth in Figures 1 8 and 19. Step 3 is a 
. straightforward calculation not involving exponentiation. Furthermore, as indicated above it is 
l;|)0ssible to split the sequence of Processing Elements into two chains which together calculate 
(cp)d and (c^)d simultaneously. 

20 Attention is now directed to a method for further simplifying the computation shown in 

step 1 immediately above. Since the input to the process is a relatively large number, perhaps 
being represented by as many as 2,048 bits, the calculation can be time consuming. However, 
the modular reduction is based on numbers Np and A^^ which are often roughly only half that size. 
Suppose then that, phrased more generally, one wishes to compute Ap=A mod A^^ and likewise Ag 

25 =A mod A^^. Without loss in generality one may assume that Np > Ng, Suppose further that rip 
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and riq are the number of bits in the binary representations for A^^ and Nq, respectively. Suppose 
even further that one picks values nip and rriq such that these are the smallest integers for which : 

nip k'^ rip +2, and 

mqk^riq^ 2, 

where k is the word size in the circuits described above for modular multiplication. With these 
parameters one may now write A in either of the two forms: 

A^Aipl'^p^+Aop. 
or 

A=A!q2"'^^^-Aoq. 

depending on whether one wishes to compute either Ap ovAg, both of which are employable in 
jthe application of the CRT as described above. If A is of the order of 2,048 bits, then: Up-^ riq ^ 
12048; and in general: O^Aop^ I'^p^; 0^ Aoq< 2^"^^ O^Aip<Np\andO<Aiq< Np. One further 
^defines two constants Cp = 2'^^'"^*mod Np and Q = 2^^'"^*mod A'^. These constants have 
^substantially the same role as the constant C = 2^^'"* mod N discussed above, but now these new 
: ^constants are employed to facilitate computation on a smaller scale problem in accordance with 
■ the representation of A as having two parts {Aip and Aop for the mod Np calculation and A jq and 
'iHoq for the mod A^^^ computation.) 

f As indicated above the present inventors have provided circuits for construction of an 
engine which implements the function f(A, B)=AB 2~^^ mod N. This engine/circuit is also fully 
capable of implementing different functions in dependence on the m and parameters. 
Accordingly, the functions fp and fq are defined as follows: 

fp(A, E)=AB2-"'p^ mod Np, 
and 

fq{A, B)=AB2-^<^^ modNq. 
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# • 

Consider first the use of fp in the calculation of Ap based on the use of the two part representation 
of A as Aipl'"'"'' + Aop. 

a = fpiAop, 1) = Aop 2-""^* mod Np 
b=fp{A,p2'"p'',])= AiptnodNp 
g = a + b = Ajp+ Aop 2"'"''* mod Np 
fp(g, Cp) = g 2-'""* 2+2'".* mod Np , 

= ^2'""* mod A/p, 

= A,p 2'"''*+^opmodNp 

= A mod Np 

= Ap 

In the same manner one uses the circuits herein to compute Ag using the parameters rrig and A^^ to 
produce fg as defined above. 

The overall structure for a preferred embodiment of cryptographic engine 700 employing 
= 5;he circuit and operational principles set forth above is shown in Figure 22. The main feature of 
,,3:ryptographic engine 700 is the inclusion of modulo A/^ multiplier 600 as described above. It is 
J^' poted that, as implemented herein as a sequence of independent Processing Elements (PE's), 

multiplier engine 600 is dividable into two pieces by the operation of electrically controlling a 
riProcessing Element so as to cause it to operate as a "PEo" element. This is particularly useful 
[ during decryption operations since in this circumstance the receiver knows both A^^ and Ng, 
i;3vhereas during encryption the sender knows only the product N = NpNg, 

For the calculation of mod A^, register set 658 contains registers for holding the 
following values: A, Bp, Bg, Np, Ng and U, where Bp^B mod (A^^ - 1) and Bg = B mod (Ng - 7). 
Register set 658 also preferably includes at least two utility registers for holding temporary 
and/or intermediate results. In particular two such utility registers are preferably employed to 
contain the values Aig and Aog as described above, with Aop andAjp being thus stored in the AH 
and AL registers respectively. Clearly, the roles of these two utility registers are 
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interchangeable. Register set 658 also includes an output register which contams output results 
from muhiplier engine 600. 

Cryptographic engine 700 also includes modular reduction unit 653 (also described herein 
as Auxiliary Computation circuit in Figure 23) which performs addition and subtraction 
5 operations and performs single shot modular reductions. 

The flow of signals across databus 670 between register set 658 to and from multiplier 
engine 600 and modular reduction unit 653 is carried out under control of Finite State Machine 
(FSM) Command Control Unit 660 in accordance with the methods, algorithms, and protocols 
set forth above for carrying out any or all of the following: modular multiplication, constant C 
10 generation, exponentiation and the use of the Chinese Remainder Theorem (CRT) for calculating 
j jnodular numbers and for efficient exponentiation. 

I/O control unit 665, besides implementing the decoding and control function necessary 
supply values such as^, B, N, Bp, B^, Np, Ng and U to the registers set 658 through databus 
!;'B70, provides two important functions in the case of modular exponentiation with CRT: The 
15 - first important function is that it dynamically calculates the value of m or nip and nig and it also 
nealculates the lengths of the exponents B or Bp and 5^. Each value of the ^w's is a function of the 
' length of a modulus (position of the leading 1) and is a key parameter used throughout the 
l:l)perations. The length of an exponent is simply used to determine when to stop the 
exponentiation process. The traditional solution is the use of a length detector that monitor the 
20 value of each bit in this large registers. This approach has disadvantages in terms of requiring 
more silicon area and also in terms of electrical loading on the output of the registers. The 
approach used in the I/O control logic is much less wasteful and is based on the detection of the 
leading *r in the k bit word being written and the associated address. Every time a non-zero k 
bit word is written, a small piece of logic is used to calculate the location of the most significant 
25 'V which is being written, based on the address of the word itself, and is compared with a value 
stored in a register that is the result of the loading of the previous k bit word. If the new value 
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calculated is larger than the value stored in the register, the register is updated accordingly. The 
calculation of the m parameter follows a similar approach and thus saves the need for a lookup 
table and another large leading '1' detector. The second important function is that in preparation 
for performing modular exponentiation with the CRT, the values of Aop, Aiq, and Aoq, as 
5 described previously, are calculated and loaded into separate registers under control of I/O 
control unit 665. 

Commands which externally govern the operation of engine 700 are also supplied via I/O 
control unit 665. Attention is now directed to a checking system and method which takes the 
fullest advantage of the modular multiplication circuits described above. In general, there are 
10 several ways to provide checking for the results of the hardware operations carried out by the 
system of the present invention. However, most of the standard approaches to checking are 
lljiegatively impacted by size, economies of chip real estate and/or by the fact that the arithmetic 
^'"pperations carried out are modulo operations. For example, result checking based on a straight 
;:^orward duplication of hardware is very expensive in terms of "silicon real estate." Error 
15 =;3:hecking for the various function blocks employed (multipliers, adders, controls, etc.) is also 

very expensive and complicated. Lastly, the use of residue arithmetic check sum methods is not 
directly applicable to checksums for the modular multiplication hardware that implements the Z 
\^j{A,E) = AB 2""* mod function described above. For example, if Z\ A and 5' are the check 
\ =sums of Z, ^, and 5, respectively, then it is still unfortunately the case that Z' is not necessarily 
20 :;*qual XofiA B"). Accordingly, driven by the inappropriateness of standard approaches to 

hardware operation checking, there is provided herein a method and system which is closely tied 
to the architecture described above and which is particularly tied to the fact that the systems 
herein perform modulo multiplication using X and Z phases of operation and employ a 
plurality of Processing Elements based on the notion of partitioning the operands involved into a 
25 plurality, /w, of k bit words. 

For an easier understanding of the checking method and system herein, one starts with an 
understanding of the process described above: 
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Process inputs : A, B,N (where N is, of course, odd) 

n = number of bits in the binary representation of N 

k = number of bits in a word (i.e., in each chunk processed by one of the Processing 
Elements. 

m = smallest integer for which mk ^ «+2 
No = least significant k bits of A'^ 
R = 2^ 

s = (-l/No) modR 
A= 1 AiR 

Process output : Z = f{A, E) = AB 2"^^ mod 
. Process : 

I Set Zo = 0 

■'•t For / = 0 to m-l do: 

,fl X- phase : 

X,^Z&A^B 

Yi^i = s X/.o mod /? (where JCy.o = least significant A: bits of 
=5 Z- phase : 

End for. 



Based on the above algorithm, structure, and process, the following equations lie at the 
heart of the model employed herein for checking the operation of the modulo 7^ multiplication 



circuits: 






m-\ 


A^ 


Z AiR 




i=0 




m-l 


B = 


S BiR' 
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m-1 
m-\ 

Z= I ZiR' 

m-\ m-1 

f{A,B) = {AB)/R'" + NZ Yi/R""' Z mod (7? - 7) = S Z, mod (7? -7) 

m-1 

^AB^N I y; mod(7?-7) 

m-1 m-1 

5 = [ ( 2 ^/ mod (R-1)) (2 5; mod (/? - 7) ) + 

/=0 f=0 

m-l m-1 T 

(I M mod (7? - 7) ) (I 7, mod (7?-7) ] mod (7? - 7) 

The hardware which calculates the flinctionX^, B) is therefore checkable through the use of the 
following relationship (referred to below as Equation (1)): 

m-1 r m-1 m-1 

S Z,mod(i?-/) = [(.2 ^mod (/?-/)) (2 5, mod (/? - 7) ) + 

/=0 /=0 /=0 

m-1 m-1 T 

10 O ( 2 iV, mod (i? - i) ) ( I y; mod (i? - 1) )] mod (i? - 1) (1) 

' jThe fortunate part of this checksum calculation is that it is computed on the fly. For example, the 

m-1 

^=^ircuitry necessary for the calculation of 7/ mod (7? - 7) is shown in Figure 24. It is noted, 

M however, that the circuit(s) shown in Figure 24 are provided for the specific case of the use of the 
Chinese Remainder Theorem where Np and A^^ are known and the Processing Elements are split 
15 i iP^^ independent chains, one for calculating multiplication modulo Np and the other for 
' Calculating multiplication modulo A^^. In the case of modulo A^;, calculations, accumulating 
oegister 7 (reference numeral 652.3a; not to be confused with the;;, variable used above to 
""flescribe the algorithm) is initially set to zero with its output being used as an input to adder 
652.2a along with the input j^/.;, from the corresponding portion of register for the Processing 
20 Element partition which generates the yi values. The input from register 652. la is added to the 
current Yp value to produce a running accumulation which is stored between cycles in register Y 
(reference numeral 652.3a). At the end of m cycles the contents of this register is the value Y*p = 

m-1 

I Yi,p mod (7? - 7). Likewise, the corresponding circuit shown in the lower portion of Figure 24 

m-1 

operates in an identical fashion to compute Y*q = Z Y^ mod (7? - 7). In the case of both the Y*p 

p=0 
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and the Yg computations, adders 652.2a and 652.2b respectively are each k bit integer binary 
adders with carries out of the high order position being fed back as carry inputs to the low order 
positions. In this way addition modulo (R - 1) is carried out. 

Thus, the circuits shown in Figure 24 supply check sum values Yp and Yq to check sum 
predictor circuit 800 of Figure 25. It is noted that circuits (not shown) very similar to those of 
Figure 24 are likewise provided for the generation of checksum values A 'p and A q from 
accumulated sums (modulo (/? - 7)) of the values and Ai^g respectively for i ^ 0, I, ... , m - 1 , 
Similarly, checksum values Bp and Bp are generated from similar circuits (also not shown) . 
Similar circuits also generate the values Np and Nq from the Ni,p and Ni,q values. Since these 
circuits are identical in structure and operation and differ only in the naming of the signal 
components, like the circuits mentioned just above they are also not shown herein. 

•3 The addition operation indicated in Equation (1) is carried out by adder 820 which 

: jperforms addition modulo (R - 1) and accordingly, like the other adders in the checksum system, 
l^ncludes a high order carry out signal output which is fed back as a low order carry input, as 
-?shown. Multiplexors 824, 825, 826, and 827 are operated under control of two signal lines. A 
= Jirst signal control line (p/q) controls multiplexors 824 and 826 to select between the two 
independent Processor Element chains for A';, and A^^ processing. A second signal control line 
I l|Select Add) controls multiplexors 825 and 827 to effect the cumulative addition operation 
J indicated by the summation from / = 0 to (w - 7) in Equation (1). In order to calculate the 
intermediate checksum values ^4/? 5/? and^'^ and Bq a final addition operation is performed 
which adds together the contents of the Po and Pi registers (reference numerals 821 and 822, 
respectively) via operation of the Select Add control line. Adder 820 is also responsible for the 
final addition which generates (AByp and (AByq by adding together the previous checksum 
values, stored in registers 831 and 832, with the cumulative checksums (NY)'p and (NY)'q. This 
results in the generation of the P Checksum and Q Checksum values from registers 831 and 832 
respectively. These signal lines are supplied to main checksum generation block 670 (in Figure 
23). In particular, the P Checksum and Q Checksum signal lines are supplied to comparators 
657a and 657b, respectively, as shown in Figure 26. 
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Accordingly, attention is now focused on the structure and operation of Figure 26. The 
main function of block 670 is the calculation of the left hand side of equation (1). As above, this 
circuit has two parts devoted to split calculations based on Np and A^^ operations as when the 
Processor Elements in Figure 7 are split by controlling a middle Processing Element so as to 
5 force it into operating in the PEo mode. 

Each Processing Element chain (the Np chain or the A^^ chain) outputs results of the 
modular multiplication operation 2k bits at a time. Accordingly, the circuit for generating the 
checksum value Z' for the Z variable is implemented as two adders with k bits each. 
Additionally, because of the splitting, there are actually a total of four adders shown in Figure 26. 
10 For the Np chain, for example, adder 656a, processes the high order bits output from the 

I: J multiplication operation that produces each high order k bit output from the chain working on the 
jmodulo Np multiplication. After all of the 2k bit portions have been added together, multiplexor 
;;~656a2 is operated to add together the sums in the high order register Zp,H and the low order 
.^register Zp,L. This resulting sum is compared with the P Checksum value by comparator 657a to 
15 r/Jproduce an error indication Error2a, if there is no match. It is also noted that the adders in Figure 
26 all perform addition modulo (R - 1) and include a carry feedback out of the high order 
fyposition into the low order position. The bottom circuit shown in Figure 26 is structured and 
\ .pperates in the same way as the upper circuits. However, as is clearly evident the bottom circuit 
y;4s associated with and operates on signals generated during calculations modulo Nq based on the 
20 splitting of the Processor Element chain as described. Accordingly, the lower circuit in Figure 26 
generates the ZV checksum signal from the modulo Nq calculations, which resultant value is 
compared in comparator 657b to generate error signal Errorjb, if there is no match. Thus, the 
output of block 670 is describable as: Error2a OR Error2b. Thus, at the end of each modular 
multiplication operation, an error signal is available which functions to provide an indication that 
25 all hardware elements have worked as designed to produce the intended result. 
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Additionally, Figure 23 also shows the inclusion of Auxiliary Computation circuit 653. 
This circuit is used to perform auxiliary operations such asZ = J-^K, Z = J-K and Z - J mod 
A^. Checksum operations for these calculations are optional but preferable. The calculations 
carried out by Auxiliary Computation circuit 653 are relatively simple in comparison with the 
modular multiplication features. Residue checking for these calculations are also relatively 
simple. For the addition operation Z = J + AT, the checking mechanism is to make sure that the 
value of Z mod (R-l) is the same as the value of the modulo (R-I) sum of (J mod R-1) and {K 
mod R-I), where i? is an even integer. Similarly, to check the operation of Z-J-K, one is to 
check if the value of Z mod (R-I) is the same as the value of the modulo (R-I) difference of (J 
mod R-I) and (K mod R-I). As for the operation of the modular reduction Z = Jmod N that is 
implemented by a long division, Z is the remainder of J divided by N, One has the expression J 
= QN + Z, where Q is the quotient. The error checking for this modular reduction operation can 

3be carried out by comparing the value of /mod (R-I) and the modulo sum of (Q mod (R-I))(N 

imod (R-I)) and (Z mod (R-I)), 

Z While many of the concepts presented above have been couched in terms of what are 

jseemingly purely mathematical algorithms, the applications involved are really directed to the 
encryption, transmission and decryption of messages in whatever form these messages may be 
^Jrepresented, as long as they are in digital form, or its equivalent (octal, binary coded decimal or 
^lexadecimal). In these methods for encryption, transmission and decryption, messages are 
Represented by large integers expressed in binary form so that for purposes explaining the theory, 
operation and value of the methods and devices presented herein, the description is necessarily of 
a mathematical nature. Nonetheless, the devices and methods describes herein provide practical 
methods for ensuring secure communications. As such the devices and methods described 
herein represent practical implementations of mathematical concepts. 

It is also noted that the operation of the circuits described herein are meant to occur over a 
repeated number of cycles. The description herein sets forth the ideal number of cycles generally 
required for proper operation in the most general situations. However, neither the specification 



POU920000179US1 



-58- 



nor claims should be interpreted as being limited to the most general cases. In particular, it is 
noted that suboptimal control methods can sometimes lead to operation of the circuits for more 
cycles than is absolutely necessary, either by accident or by design. The scope of the claims 
herein should not be so narrowly construed as to forego this inclusion. Likewise, for certain 
input situations, the full number of cycles normally required for the most general cases is not 
required. Accordingly, some of the claims herein recite the operation for at most t cycles. 
Clearly, for its intended use in encryption and decryption, the circuits herein have been designed 
to handle the most general cases. The claims, however, should not be construed to be so narrow 
as to exclude either the simpler cases or the cases of deliberate operation over more than the 
necessary number of cycles. 



Accordingly, from the above, it is seen that all of the objectives indicated are achieved by 
Qhe circuits and processes described herein. In particular, it is seen that there is provided a circuit 
-..and process for carrying out multipUcation of relatively large numbers modulo using either 
r;3iiultiplier and adder arrays or a plurality of nearly identical processing elements. It is also seen 
==%at these same circuits can be used not only to implement modular exponentiation but can also 
employed as part of hardware circuits for implementing solutions to problems based on the 

! phinese Remainder Theorem. It is even further noted that the objective of providing pipelined 

J, 

i ibperations for a series of connected processing elements is achieved in a manner in which all of 
J Ihe processing elements are functioning at all times to produce desired final or intermediate 
:;Jesults. And it is also seen that circuits are provided for carrying out functions which are 
ancillary to the processes described above and, in particular, circuits and processes for producing 
negative multiplicative inverses. While such inverses are providable in a data processing system 
via software or by means of prior (and perhaps separate) computation, the processes and circuits 
shown herein are capable of providing this function in a short period of time with relatively 
simple hardware which takes advantage of already existing circuit registers and other elements. 

From the above, it is clear that the circuits shown in applicants' figures fulfill all of the 
objects indicated. Additionally, it is noted that the circuit is easy to construct and takes full 
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advantage of the parallelism brought about by structuring one of the operands in the 
multiplication process as m blocks of k bits each. In particular, it is seen that the circuit shown 
herein carries out a two-phase operation, one of which computes X and yi, with the second phase 
computing a value for Z/ which eventually, at the last step, becomes a desired result. In 
particular, it is seen that the circuit shown in applicants' figures provides a desired trade off 
between multipliers which have to be n bits by n bits in size and between serial circuits which 
operate with only one bit of a factor being considered at each step. 

While the invention has been described in detail herein in accordance with certain preferred 
embodiments thereof, many modifications and changes therein may be effected by those skilled 
in the art. Accordingly, it is intended by the appended claims to cover all such modifications and 
changes as fall within the true spirit and scope of the invention. 
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